2014-10-13 13:12:13

by Prarit Bhargava

[permalink] [raw]
Subject: Locking issues with cpufreq and sysfs

There are several issues with the current locking design of cpufreq. Most
notably is the panic reported here:

http://marc.info/?l=linux-kernel&m=140622451625236&w=2

which was introduced by commit 955ef4833574636819cd269cfbae12f79cbde63a,
cpufreq: Drop rwsem lock around CPUFREQ_GOV_POLICY_EXIT, which introduces
a race in the changing of the cpufreq policy. This change was introduced
because of a lockdep deadlock warning that can be reproduced (on x86 with
the acpi_cpufreq driver) via the following debug patch:

iff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
index b0c18ed..366cfb7 100644
--- a/drivers/cpufreq/acpi-cpufreq.c
+++ b/drivers/cpufreq/acpi-cpufreq.c
@@ -885,6 +885,7 @@ static struct freq_attr *acpi_cpufreq_attr[] = {

static struct cpufreq_driver acpi_cpufreq_driver = {
.verify = cpufreq_generic_frequency_table_verify,
+ .flags = CPUFREQ_HAVE_GOVERNOR_PER_POLICY,
.target_index = acpi_cpufreq_target,
.bios_limit = acpi_processor_get_bios_limit,
.init = acpi_cpufreq_cpu_init,
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 61190f6..4cb488a 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2195,9 +2195,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *polic
/* end old governor */
if (old_gov) {
__cpufreq_governor(policy, CPUFREQ_GOV_STOP);
- up_write(&policy->rwsem);
__cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
- down_write(&policy->rwsem);
}

/* start new governor */
@@ -2206,9 +2204,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *polic
if (!__cpufreq_governor(policy, CPUFREQ_GOV_START))
goto out;

- up_write(&policy->rwsem);
__cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
- down_write(&policy->rwsem);
}

/* new governor failed, so re-start old one */

(which causes the acpi-cpufreq driver to emulate the behaviour of the arm
cpufreq driver), and by doing

echo ondemand > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
cat /sys/devices/system/cpu/cpu5/cpufreq/ondemand/*
echo conservative > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
exit 0

[Question: was the original reported deadlock "real"? Did it really happen or
did lockdep only report it (I may have asked this question previously and
forgot the answer)? The reason I ask is that this situation is very similar to
USB's device removal in which the sysfs attributes are removed for a device but
not the device it was called for. I actually think that's part of the problem
here.]

The above, obviously, is a complete hack of the code but in a sense does
mimic a proper locking fix. However, even with this fix we are still left
with a race in accessing the sysfs files. Consider the following example,

CPU 1: accesses scaling_setspeed to set cpu speed

simultaneously,

CPU 2: accesses scaling_governor to set governor to ondemand

CPU 1 & 2 race ... and this can result in different critical situations.
The first is that CPU 1 holds the scalling_setspeed open while CPU attempts
to change the governor. This results in a syfs warning about creating a
file with an existing file name which in some cases can lead to additional
corruption and a panic. The second case is that CPU 1's setting of the speed
is now done on the new governor -- which may or may not be correct. In any
case an argument could be made that the userspace program doing this type
of action should be "smart" enough to confirm simultaneous changes... but
in any case the kernel should not panic or corrupt data.

The locking is insufficient here, Viresh. I no longer believe that fixes
to this locking scheme are the right way to move forward here. I'm wondering
if we can look at other alternatives such as maintaining a refcount or
perhaps using a queuing mechanism for governor and policy related changes.

P.


2014-10-13 13:23:05

by Prarit Bhargava

[permalink] [raw]
Subject: Re: Locking issues with cpufreq and sysfs



On 10/13/2014 09:11 AM, Prarit Bhargava wrote:
> There are several issues with the current locking design of cpufreq. Most
> notably is the panic reported here:
>
> http://marc.info/?l=linux-kernel&m=140622451625236&w=2
>
> which was introduced by commit 955ef4833574636819cd269cfbae12f79cbde63a,
> cpufreq: Drop rwsem lock around CPUFREQ_GOV_POLICY_EXIT, which introduces
> a race in the changing of the cpufreq policy. This change was introduced
> because of a lockdep deadlock warning that can be reproduced (on x86 with
> the acpi_cpufreq driver) via the following debug patch:
>
> iff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
> index b0c18ed..366cfb7 100644
> --- a/drivers/cpufreq/acpi-cpufreq.c
> +++ b/drivers/cpufreq/acpi-cpufreq.c
> @@ -885,6 +885,7 @@ static struct freq_attr *acpi_cpufreq_attr[] = {
>
> static struct cpufreq_driver acpi_cpufreq_driver = {
> .verify = cpufreq_generic_frequency_table_verify,
> + .flags = CPUFREQ_HAVE_GOVERNOR_PER_POLICY,
> .target_index = acpi_cpufreq_target,
> .bios_limit = acpi_processor_get_bios_limit,
> .init = acpi_cpufreq_cpu_init,
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 61190f6..4cb488a 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -2195,9 +2195,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *polic
> /* end old governor */
> if (old_gov) {
> __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
> - up_write(&policy->rwsem);
> __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
> - down_write(&policy->rwsem);
> }
>
> /* start new governor */
> @@ -2206,9 +2204,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *polic
> if (!__cpufreq_governor(policy, CPUFREQ_GOV_START))
> goto out;
>
> - up_write(&policy->rwsem);
> __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
> - down_write(&policy->rwsem);
> }
>
> /* new governor failed, so re-start old one */
>
> (which causes the acpi-cpufreq driver to emulate the behaviour of the arm
> cpufreq driver), and by doing
>
> echo ondemand > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
> cat /sys/devices/system/cpu/cpu5/cpufreq/ondemand/*
> echo conservative > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
> exit 0
>
> [Question: was the original reported deadlock "real"? Did it really happen or
> did lockdep only report it (I may have asked this question previously and
> forgot the answer)? The reason I ask is that this situation is very similar to
> USB's device removal in which the sysfs attributes are removed for a device but
> not the device it was called for. I actually think that's part of the problem
> here.]
>
> The above, obviously, is a complete hack of the code but in a sense does
> mimic a proper locking fix. However, even with this fix we are still left
> with a race in accessing the sysfs files. Consider the following example,
>
> CPU 1: accesses scaling_setspeed to set cpu speed
>
> simultaneously,
>
> CPU 2: accesses scaling_governor to set governor to ondemand
>
> CPU 1 & 2 race ... and this can result in different critical situations.
> The first is that CPU 1 holds the scalling_setspeed open while CPU attempts
> to change the governor. This results in a syfs warning about creating a
> file with an existing file name which in some cases can lead to additional
> corruption and a panic. The second case is that CPU 1's setting of the speed
> is now done on the new governor -- which may or may not be correct. In any
> case an argument could be made that the userspace program doing this type
> of action should be "smart" enough to confirm simultaneous changes... but
> in any case the kernel should not panic or corrupt data.
>
> The locking is insufficient here, Viresh. I no longer believe that fixes
> to this locking scheme are the right way to move forward here. I'm wondering
> if we can look at other alternatives such as maintaining a refcount or
> perhaps using a queuing mechanism for governor and policy related changes.
>

Uh ... I meant this as "I'm willing to modify the code to do this but I'd like
to know what everyone else thinks before I do anything" ;)

P.

> P.
>

2014-10-13 14:49:32

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Locking issues with cpufreq and sysfs

On Monday, October 13, 2014 09:22:49 AM Prarit Bhargava wrote:
>
> On 10/13/2014 09:11 AM, Prarit Bhargava wrote:
> > There are several issues with the current locking design of cpufreq. Most
> > notably is the panic reported here:
> >
> > http://marc.info/?l=linux-kernel&m=140622451625236&w=2
> >
> > which was introduced by commit 955ef4833574636819cd269cfbae12f79cbde63a,
> > cpufreq: Drop rwsem lock around CPUFREQ_GOV_POLICY_EXIT, which introduces
> > a race in the changing of the cpufreq policy. This change was introduced
> > because of a lockdep deadlock warning that can be reproduced (on x86 with
> > the acpi_cpufreq driver) via the following debug patch:
> >
> > iff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
> > index b0c18ed..366cfb7 100644
> > --- a/drivers/cpufreq/acpi-cpufreq.c
> > +++ b/drivers/cpufreq/acpi-cpufreq.c
> > @@ -885,6 +885,7 @@ static struct freq_attr *acpi_cpufreq_attr[] = {
> >
> > static struct cpufreq_driver acpi_cpufreq_driver = {
> > .verify = cpufreq_generic_frequency_table_verify,
> > + .flags = CPUFREQ_HAVE_GOVERNOR_PER_POLICY,
> > .target_index = acpi_cpufreq_target,
> > .bios_limit = acpi_processor_get_bios_limit,
> > .init = acpi_cpufreq_cpu_init,
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index 61190f6..4cb488a 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -2195,9 +2195,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *polic
> > /* end old governor */
> > if (old_gov) {
> > __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
> > - up_write(&policy->rwsem);
> > __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
> > - down_write(&policy->rwsem);
> > }
> >
> > /* start new governor */
> > @@ -2206,9 +2204,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *polic
> > if (!__cpufreq_governor(policy, CPUFREQ_GOV_START))
> > goto out;
> >
> > - up_write(&policy->rwsem);
> > __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
> > - down_write(&policy->rwsem);
> > }
> >
> > /* new governor failed, so re-start old one */
> >
> > (which causes the acpi-cpufreq driver to emulate the behaviour of the arm
> > cpufreq driver), and by doing
> >
> > echo ondemand > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
> > cat /sys/devices/system/cpu/cpu5/cpufreq/ondemand/*
> > echo conservative > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
> > exit 0
> >
> > [Question: was the original reported deadlock "real"? Did it really happen or
> > did lockdep only report it (I may have asked this question previously and
> > forgot the answer)? The reason I ask is that this situation is very similar to
> > USB's device removal in which the sysfs attributes are removed for a device but
> > not the device it was called for. I actually think that's part of the problem
> > here.]
> >
> > The above, obviously, is a complete hack of the code but in a sense does
> > mimic a proper locking fix. However, even with this fix we are still left
> > with a race in accessing the sysfs files. Consider the following example,
> >
> > CPU 1: accesses scaling_setspeed to set cpu speed
> >
> > simultaneously,
> >
> > CPU 2: accesses scaling_governor to set governor to ondemand
> >
> > CPU 1 & 2 race ... and this can result in different critical situations.
> > The first is that CPU 1 holds the scalling_setspeed open while CPU attempts
> > to change the governor. This results in a syfs warning about creating a
> > file with an existing file name which in some cases can lead to additional
> > corruption and a panic. The second case is that CPU 1's setting of the speed
> > is now done on the new governor -- which may or may not be correct. In any
> > case an argument could be made that the userspace program doing this type
> > of action should be "smart" enough to confirm simultaneous changes... but
> > in any case the kernel should not panic or corrupt data.
> >
> > The locking is insufficient here, Viresh. I no longer believe that fixes
> > to this locking scheme are the right way to move forward here. I'm wondering
> > if we can look at other alternatives such as maintaining a refcount or
> > perhaps using a queuing mechanism for governor and policy related changes.
> >
>
> Uh ... I meant this as "I'm willing to modify the code to do this but I'd like
> to know what everyone else thinks before I do anything" ;)

OK, that's constructive. :-)

Can we discuss the target design first, please? You certainly have something
in mind, so can you describe it?

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-10-14 07:11:02

by Viresh Kumar

[permalink] [raw]
Subject: Re: Locking issues with cpufreq and sysfs

On 13 October 2014 18:41, Prarit Bhargava <[email protected]> wrote:
> There are several issues with the current locking design of cpufreq. Most

Sadly yes :(

> [Question: was the original reported deadlock "real"? Did it really happen or
> did lockdep only report it (I may have asked this question previously and
> forgot the answer)? The reason I ask is that this situation is very similar to
> USB's device removal in which the sysfs attributes are removed for a device but
> not the device it was called for. I actually think that's part of the problem
> here.]

I am still not sure about those lockdep warnings :(

> The above, obviously, is a complete hack of the code but in a sense does
> mimic a proper locking fix. However, even with this fix we are still left
> with a race in accessing the sysfs files. Consider the following example,
>
> CPU 1: accesses scaling_setspeed to set cpu speed
>
> simultaneously,
>
> CPU 2: accesses scaling_governor to set governor to ondemand
>
> CPU 1 & 2 race ... and this can result in different critical situations.
> The first is that CPU 1 holds the scalling_setspeed open while CPU attempts
> to change the governor. This results in a syfs warning about creating a
> file with an existing file name which in some cases can lead to additional
> corruption and a panic. The second case is that CPU 1's setting of the speed
> is now done on the new governor -- which may or may not be correct. In any
> case an argument could be made that the userspace program doing this type
> of action should be "smart" enough to confirm simultaneous changes... but
> in any case the kernel should not panic or corrupt data.
>
> The locking is insufficient here, Viresh. I no longer believe that fixes
> to this locking scheme are the right way to move forward here. I'm wondering
> if we can look at other alternatives such as maintaining a refcount or
> perhaps using a queuing mechanism for governor and policy related changes.

Probably this is similar to what I have been trying, i.e. to give access to only
a single thread to call __cpufreq_governor().

Can you please try the cpufreq/governor-fixes-v4 branch once ?

On 13 October 2014 18:41, Prarit Bhargava <[email protected]> wrote:
> There are several issues with the current locking design of cpufreq. Most
> notably is the panic reported here:
>
> http://marc.info/?l=linux-kernel&m=140622451625236&w=2
>
> which was introduced by commit 955ef4833574636819cd269cfbae12f79cbde63a,
> cpufreq: Drop rwsem lock around CPUFREQ_GOV_POLICY_EXIT, which introduces
> a race in the changing of the cpufreq policy. This change was introduced
> because of a lockdep deadlock warning that can be reproduced (on x86 with
> the acpi_cpufreq driver) via the following debug patch:
>
> iff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
> index b0c18ed..366cfb7 100644
> --- a/drivers/cpufreq/acpi-cpufreq.c
> +++ b/drivers/cpufreq/acpi-cpufreq.c
> @@ -885,6 +885,7 @@ static struct freq_attr *acpi_cpufreq_attr[] = {
>
> static struct cpufreq_driver acpi_cpufreq_driver = {
> .verify = cpufreq_generic_frequency_table_verify,
> + .flags = CPUFREQ_HAVE_GOVERNOR_PER_POLICY,
> .target_index = acpi_cpufreq_target,
> .bios_limit = acpi_processor_get_bios_limit,
> .init = acpi_cpufreq_cpu_init,
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 61190f6..4cb488a 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -2195,9 +2195,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *polic
> /* end old governor */
> if (old_gov) {
> __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
> - up_write(&policy->rwsem);
> __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
> - down_write(&policy->rwsem);
> }
>
> /* start new governor */
> @@ -2206,9 +2204,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *polic
> if (!__cpufreq_governor(policy, CPUFREQ_GOV_START))
> goto out;
>
> - up_write(&policy->rwsem);
> __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
> - down_write(&policy->rwsem);
> }
>
> /* new governor failed, so re-start old one */
>
> (which causes the acpi-cpufreq driver to emulate the behaviour of the arm
> cpufreq driver), and by doing
>
> echo ondemand > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
> cat /sys/devices/system/cpu/cpu5/cpufreq/ondemand/*
> echo conservative > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
> exit 0
>
> [Question: was the original reported deadlock "real"? Did it really happen or
> did lockdep only report it (I may have asked this question previously and
> forgot the answer)? The reason I ask is that this situation is very similar to
> USB's device removal in which the sysfs attributes are removed for a device but
> not the device it was called for. I actually think that's part of the problem
> here.]
>
> The above, obviously, is a complete hack of the code but in a sense does
> mimic a proper locking fix. However, even with this fix we are still left
> with a race in accessing the sysfs files. Consider the following example,
>
> CPU 1: accesses scaling_setspeed to set cpu speed
>
> simultaneously,
>
> CPU 2: accesses scaling_governor to set governor to ondemand
>
> CPU 1 & 2 race ... and this can result in different critical situations.
> The first is that CPU 1 holds the scalling_setspeed open while CPU attempts
> to change the governor. This results in a syfs warning about creating a
> file with an existing file name which in some cases can lead to additional
> corruption and a panic. The second case is that CPU 1's setting of the speed
> is now done on the new governor -- which may or may not be correct. In any
> case an argument could be made that the userspace program doing this type
> of action should be "smart" enough to confirm simultaneous changes... but
> in any case the kernel should not panic or corrupt data.
>
> The locking is insufficient here, Viresh. I no longer believe that fixes
> to this locking scheme are the right way to move forward here. I'm wondering
> if we can look at other alternatives such as maintaining a refcount or
> perhaps using a queuing mechanism for governor and policy related changes.
>
> P.

2014-10-14 18:24:42

by Prarit Bhargava

[permalink] [raw]
Subject: Re: Locking issues with cpufreq and sysfs



On 10/14/2014 03:10 AM, Viresh Kumar wrote:
> On 13 October 2014 18:41, Prarit Bhargava <[email protected]> wrote:
>>
>> The locking is insufficient here, Viresh. I no longer believe that fixes
>> to this locking scheme are the right way to move forward here. I'm wondering
>> if we can look at other alternatives such as maintaining a refcount or
>> perhaps using a queuing mechanism for governor and policy related changes.
>>

Here's what I think we should do. Taking a step back, the purpose of the
cpufreq sysfs files is to allow userspace to read current cpu frequency
settings, and to allow userspce to modify the governor and set the max & min
ranges for cpu frequencies. This can be done per device or for all cpus
depending on the driver.

We have to guarantee that bothing reading and writing will always work and that
write operations will always be atomic relative to userspace. The current
implementation of cpufreq does this through the following locks:

cpufreq_driver_lock: protects the cpufreq_cpu_data array and cpufreq_driver->boost
cpufreq_governor_lock: protects the current governor
cpufreq_governor_mutex: protects the cpufreq_governor_list
cpufreq_rwsem: protects the driver from being unloaded
global_kobj_lock: protects the "cpufreq" kobject
each policy has a rwsem (policy->rwsem): protects the cpufreq_policy struct
each policy has a transition_lock (policy->transition): synchronizes
frequency transitions

While examining this code I was wondering exactly why we allow multiple readers
and writers in cpufreq. I could understand if we felt that this data was
critical; but it really isn't. A short delay here isn't that big of a deal IMO
(if someone can produce a case where a delay would cause a serious problem I'd
like to hear it). I don't even think it is safe in most cases to allow readers
while cpufreq values are changing; if we're changing the governor userspace
cannot rely on the value of (for example) cpuinfo_max_freq.

So I'm proposing that we move to a single threaded read/write using, if
possible, a single policy lock for now. We might transition this back to a
rwsem later on, however, for the first attempt at cleaning this up I think we
should just stick with a simple lock. In doing that, IMO we remove

cpufreq_rwsem: protects the driver from being unloaded
cpufreq_governor_lock: protects the current governor
each policy has a rwsem (policy->rwsem): protects the cpufreq_policy struct

and potentially

cpufreq_driver_lock: protects the cpufreq_cpu_data array and cpufreq_driver->boost

After looking at the way the code would be structured, I'm wondering if

cpufreq_governor_mutex: protects the cpufreq_governor_list

is overkill. The loading of a module should be atomic relative to the cpufreq
code, so this lock may not be required. (Admittedly I haven't tested that...)

That would leave:

global_kobj_lock: protects the "cpufreq" kobject
each policy has a transition_lock (policy->transition): synchronizes
frequency transitions

and a new lock, perhaps called policy->lock, to serialize all events.

Pros: We clean all this up to a simpler single threaded model. Bugs and races
here would be much easier to handle. We're currently putting band-aid on
band-aids in this code ATM and it looks like we're seeing old races expanded
or new races exposed.

Cons: We lose the ability to do simultaneous reads and writes ... although
I remain unconvinced that this would ever be safe to do. ie) If I change the
governor while at the same time reading, for example, the current cpu
frequency I cannot rely on that value to be valid.

After that we can add some reference counting to the sysfs file accesses
so that we can block after the sysfs removal when we change cpufreq
governors. I think that would be trivial and that it would resolve any races
when adding and removing governor's sysfs files.

P.

Subject: RE: Locking issues with cpufreq and sysfs



> -----Original Message-----
> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Prarit Bhargava
> Sent: Tuesday, 14 October, 2014 1:24 PM
> To: Viresh Kumar
> Cc: Saravana Kannan; Rafael J. Wysocki; [email protected]; Linux
> Kernel; Robert Schöne
> Subject: Re: Locking issues with cpufreq and sysfs
>
> On 10/14/2014 03:10 AM, Viresh Kumar wrote:
> > On 13 October 2014 18:41, Prarit Bhargava <[email protected]> wrote:
> >>
> >> The locking is insufficient here, Viresh. I no longer believe that fixes
> >> to this locking scheme are the right way to move forward here. I'm
> wondering
> >> if we can look at other alternatives such as maintaining a refcount or
> >> perhaps using a queuing mechanism for governor and policy related changes.
> >>
...
> So I'm proposing that we move to a single threaded read/write using, if
> possible, a single policy lock for now. We might transition this back to a
> rwsem later on, however, for the first attempt at cleaning this up I think we
> should just stick with a simple lock. In doing that, IMO we remove
> cpufreq_rwsem: protects the driver from being unloaded
> cpufreq_governor_lock: protects the current governor
> each policy has a rwsem (policy->rwsem): protects the cpufreq_policy struct
>
> and potentially
> cpufreq_driver_lock: protects the cpufreq_cpu_data array and cpufreq_driver-
> >boost
>
> After looking at the way the code would be structured, I'm wondering if
> cpufreq_governor_mutex: protects the cpufreq_governor_list
> is overkill. The loading of a module should be atomic relative to the
> cpufreq code, so this lock may not be required. (Admittedly I haven't
> tested that...)
>
> That would leave:
> global_kobj_lock: protects the "cpufreq" kobject
> each policy has a transition_lock (policy->transition): synchronizes
> frequency transitions
> and a new lock, perhaps called policy->lock, to serialize all events.
>

Please keep performance in mind too. cpufreq_governor_lock
contention is a bit of an issue with heavy IO workloads
as described in:
http://marc.info/?l=linux-pm&m=140924051503827&w=2


---
Rob Elliott HP Server Storage



????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-10-16 11:23:54

by Viresh Kumar

[permalink] [raw]
Subject: Re: Locking issues with cpufreq and sysfs

On 14 October 2014 23:54, Prarit Bhargava <[email protected]> wrote:
> Here's what I think we should do. Taking a step back, the purpose of the
> cpufreq sysfs files is to allow userspace to read current cpu frequency
> settings, and to allow userspce to modify the governor and set the max & min
> ranges for cpu frequencies. This can be done per device or for all cpus
> depending on the driver.

Okay.

> We have to guarantee that bothing reading and writing will always work and that
> write operations will always be atomic relative to userspace. The current

Ok.

> implementation of cpufreq does this through the following locks:
>
> cpufreq_driver_lock: protects the cpufreq_cpu_data array and cpufreq_driver->boost
> cpufreq_governor_lock: protects the current governor

Its just for serialization..

> cpufreq_governor_mutex: protects the cpufreq_governor_list
> cpufreq_rwsem: protects the driver from being unloaded
> global_kobj_lock: protects the "cpufreq" kobject
> each policy has a rwsem (policy->rwsem): protects the cpufreq_policy struct
> each policy has a transition_lock (policy->transition): synchronizes
> frequency transitions
>
> While examining this code I was wondering exactly why we allow multiple readers
> and writers in cpufreq. I could understand if we felt that this data was
> critical; but it really isn't. A short delay here isn't that big of a deal IMO
> (if someone can produce a case where a delay would cause a serious problem I'd
> like to hear it). I don't even think it is safe in most cases to allow readers
> while cpufreq values are changing; if we're changing the governor userspace
> cannot rely on the value of (for example) cpuinfo_max_freq.

I don't know how reader writer lock will fail and a normal lock will not.
There is only benefit of rwlock, that readers can read things while
there is nobody
writing..

> So I'm proposing that we move to a single threaded read/write using, if

Okay, but how will that benefit us ?

> possible, a single policy lock for now. We might transition this back to a
> rwsem later on, however, for the first attempt at cleaning this up I think we
> should just stick with a simple lock. In doing that, IMO we remove
>
> cpufreq_rwsem: protects the driver from being unloaded
> cpufreq_governor_lock: protects the current governor
> each policy has a rwsem (policy->rwsem): protects the cpufreq_policy struct
>
> and potentially
>
> cpufreq_driver_lock: protects the cpufreq_cpu_data array and cpufreq_driver->boost

Not really sure, but yeah we might be able to club few of them..

> After looking at the way the code would be structured, I'm wondering if
>
> cpufreq_governor_mutex: protects the cpufreq_governor_list
>
> is overkill. The loading of a module should be atomic relative to the cpufreq
> code, so this lock may not be required. (Admittedly I haven't tested that...)
>
> That would leave:
>
> global_kobj_lock: protects the "cpufreq" kobject
> each policy has a transition_lock (policy->transition): synchronizes
> frequency transitions
>
> and a new lock, perhaps called policy->lock, to serialize all events.
>
> Pros: We clean all this up to a simpler single threaded model. Bugs and races
> here would be much easier to handle. We're currently putting band-aid on
> band-aids in this code ATM and it looks like we're seeing old races expanded
> or new races exposed.
>
> Cons: We lose the ability to do simultaneous reads and writes ... although
> I remain unconvinced that this would ever be safe to do. ie) If I change the
> governor while at the same time reading, for example, the current cpu
> frequency I cannot rely on that value to be valid.
>
> After that we can add some reference counting to the sysfs file accesses
> so that we can block after the sysfs removal when we change cpufreq
> governors. I think that would be trivial and that it would resolve any races
> when adding and removing governor's sysfs files.

Not really sure, but if you solve few things with getting these bugs resolved
then we might apply your patches without any issues.

2014-10-17 11:38:32

by Viresh Kumar

[permalink] [raw]
Subject: Re: Locking issues with cpufreq and sysfs

On 13 October 2014 18:41, Prarit Bhargava <[email protected]> wrote:
> There are several issues with the current locking design of cpufreq. Most
> notably is the panic reported here:
>
> http://marc.info/?l=linux-kernel&m=140622451625236&w=2
>
> which was introduced by commit 955ef4833574636819cd269cfbae12f79cbde63a,
> cpufreq: Drop rwsem lock around CPUFREQ_GOV_POLICY_EXIT, which introduces

Don't know whats going wrong but I am just not able to reproduce the
lockdep again :(
I have tried this on two boards and am making sure that all things are correctly
configured. I am trying this on two of Exynos boards, One a dual-A15 and another
big little with 8 cores..


@@ -2195,9 +2195,7 @@ static int cpufreq_set_policy(struct
cpufreq_policy *policy,
/* end old governor */
if (old_gov) {
__cpufreq_governor(policy, CPUFREQ_GOV_STOP);
- up_write(&policy->rwsem);
__cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
- down_write(&policy->rwsem);
}

/* start new governor */
@@ -2206,9 +2204,7 @@ static int cpufreq_set_policy(struct
cpufreq_policy *policy,
if (!__cpufreq_governor(policy, CPUFREQ_GOV_START))
goto out;

- up_write(&policy->rwsem);
__cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
- down_write(&policy->rwsem);
}

diff --git a/drivers/cpufreq/cpufreq_governor.c
b/drivers/cpufreq/cpufreq_governor.c
index 1b44496..1a6972a 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -323,6 +323,7 @@ int cpufreq_governor_dbs(struct cpufreq_policy *policy,
return 0;
case CPUFREQ_GOV_POLICY_EXIT:
if (!--dbs_data->usage_count) {
+ pr_info("%s\n", __func__);
sysfs_remove_group(get_governor_parent_kobj(policy),
get_sysfs_attr(dbs_data));




.config attached too, please let me know what am I missing.


Attachments:
.config (72.87 kB)

2014-10-17 12:14:22

by Prarit Bhargava

[permalink] [raw]
Subject: Re: Locking issues with cpufreq and sysfs



On 10/16/2014 07:23 AM, Viresh Kumar wrote:
> On 14 October 2014 23:54, Prarit Bhargava <[email protected]> wrote:
>> Here's what I think we should do. Taking a step back, the purpose of the
>> cpufreq sysfs files is to allow userspace to read current cpu frequency
>> settings, and to allow userspce to modify the governor and set the max & min
>> ranges for cpu frequencies. This can be done per device or for all cpus
>> depending on the driver.
>
> Okay.
>
>> We have to guarantee that bothing reading and writing will always work and that
>> write operations will always be atomic relative to userspace. The current
>
> Ok.
>
>> implementation of cpufreq does this through the following locks:
>>
>> cpufreq_driver_lock: protects the cpufreq_cpu_data array and cpufreq_driver->boost
>> cpufreq_governor_lock: protects the current governor
>
> Its just for serialization..
>
>> cpufreq_governor_mutex: protects the cpufreq_governor_list
>> cpufreq_rwsem: protects the driver from being unloaded
>> global_kobj_lock: protects the "cpufreq" kobject
>> each policy has a rwsem (policy->rwsem): protects the cpufreq_policy struct
>> each policy has a transition_lock (policy->transition): synchronizes
>> frequency transitions
>>
>> While examining this code I was wondering exactly why we allow multiple readers
>> and writers in cpufreq. I could understand if we felt that this data was
>> critical; but it really isn't. A short delay here isn't that big of a deal IMO
>> (if someone can produce a case where a delay would cause a serious problem I'd
>> like to hear it). I don't even think it is safe in most cases to allow readers
>> while cpufreq values are changing; if we're changing the governor userspace
>> cannot rely on the value of (for example) cpuinfo_max_freq.
>
> I don't know how reader writer lock will fail and a normal lock will not.
> There is only benefit of rwlock, that readers can read things while
> there is nobody
> writing..
>
>> So I'm proposing that we move to a single threaded read/write using, if
>
> Okay, but how will that benefit us ?

It will greatly simplify the code. The locking isn't working in this code at
all right now and is causing various reported panics ... you yourself are
pushing a lock patch that serializes operations -- which is causing other
problems during testing.

>
>> possible, a single policy lock for now. We might transition this back to a
>> rwsem later on, however, for the first attempt at cleaning this up I think we
>> should just stick with a simple lock. In doing that, IMO we remove
>>
>> cpufreq_rwsem: protects the driver from being unloaded
>> cpufreq_governor_lock: protects the current governor
>> each policy has a rwsem (policy->rwsem): protects the cpufreq_policy struct
>>
>> and potentially
>>
>> cpufreq_driver_lock: protects the cpufreq_cpu_data array and cpufreq_driver->boost
>
> Not really sure, but yeah we might be able to club few of them..
>
>> After looking at the way the code would be structured, I'm wondering if
>>
>> cpufreq_governor_mutex: protects the cpufreq_governor_list
>>
>> is overkill. The loading of a module should be atomic relative to the cpufreq
>> code, so this lock may not be required. (Admittedly I haven't tested that...)
>>
>> That would leave:
>>
>> global_kobj_lock: protects the "cpufreq" kobject
>> each policy has a transition_lock (policy->transition): synchronizes
>> frequency transitions
>>
>> and a new lock, perhaps called policy->lock, to serialize all events.
>>
>> Pros: We clean all this up to a simpler single threaded model. Bugs and races
>> here would be much easier to handle. We're currently putting band-aid on
>> band-aids in this code ATM and it looks like we're seeing old races expanded
>> or new races exposed.
>>
>> Cons: We lose the ability to do simultaneous reads and writes ... although
>> I remain unconvinced that this would ever be safe to do. ie) If I change the
>> governor while at the same time reading, for example, the current cpu
>> frequency I cannot rely on that value to be valid.
>>
>> After that we can add some reference counting to the sysfs file accesses
>> so that we can block after the sysfs removal when we change cpufreq
>> governors. I think that would be trivial and that it would resolve any races
>> when adding and removing governor's sysfs files.
>
> Not really sure, but if you solve few things with getting these bugs resolved
> then we might apply your patches without any issues.
>

2014-10-17 12:15:44

by Prarit Bhargava

[permalink] [raw]
Subject: Re: Locking issues with cpufreq and sysfs



On 10/17/2014 07:38 AM, Viresh Kumar wrote:
> On 13 October 2014 18:41, Prarit Bhargava <[email protected]> wrote:
>> There are several issues with the current locking design of cpufreq. Most
>> notably is the panic reported here:
>>
>> http://marc.info/?l=linux-kernel&m=140622451625236&w=2
>>
>> which was introduced by commit 955ef4833574636819cd269cfbae12f79cbde63a,
>> cpufreq: Drop rwsem lock around CPUFREQ_GOV_POLICY_EXIT, which introduces
>
> Don't know whats going wrong but I am just not able to reproduce the
> lockdep again :(
> I have tried this on two boards and am making sure that all things are correctly
> configured. I am trying this on two of Exynos boards, One a dual-A15 and another
> big little with 8 cores..
>
>
> @@ -2195,9 +2195,7 @@ static int cpufreq_set_policy(struct
> cpufreq_policy *policy,
> /* end old governor */
> if (old_gov) {
> __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
> - up_write(&policy->rwsem);
> __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
> - down_write(&policy->rwsem);
> }
>
> /* start new governor */
> @@ -2206,9 +2204,7 @@ static int cpufreq_set_policy(struct
> cpufreq_policy *policy,
> if (!__cpufreq_governor(policy, CPUFREQ_GOV_START))
> goto out;
>
> - up_write(&policy->rwsem);
> __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
> - down_write(&policy->rwsem);
> }
>
> diff --git a/drivers/cpufreq/cpufreq_governor.c
> b/drivers/cpufreq/cpufreq_governor.c
> index 1b44496..1a6972a 100644
> --- a/drivers/cpufreq/cpufreq_governor.c
> +++ b/drivers/cpufreq/cpufreq_governor.c
> @@ -323,6 +323,7 @@ int cpufreq_governor_dbs(struct cpufreq_policy *policy,
> return 0;
> case CPUFREQ_GOV_POLICY_EXIT:
> if (!--dbs_data->usage_count) {
> + pr_info("%s\n", __func__);
> sysfs_remove_group(get_governor_parent_kobj(policy),
> get_sysfs_attr(dbs_data));
>
>

Hmmm

This is what I'm doing:

echo ondemand > scaling_governor
cat ondemand/*
echo conservative > scaling_governor

OOC what are you doing to test?

P.

>
>
> .config attached too, please let me know what am I missing.
>

2014-10-17 13:25:33

by Viresh Kumar

[permalink] [raw]
Subject: Re: Locking issues with cpufreq and sysfs

On 17 October 2014 17:45, Prarit Bhargava <[email protected]> wrote:
> Hmmm
>
> This is what I'm doing:
>
> echo ondemand > scaling_governor
> cat ondemand/*
> echo conservative > scaling_governor
>
> OOC what are you doing to test?

Exactly same and even tried Roberts script too :)