2014-12-16 23:09:55

by Dmitry Torokhov

[permalink] [raw]
Subject: [PATCH 0/4] Allow cpufreq-dt to defer probe if OPP table is not ready

[ Resending as my previous attempt appears to be messed up... ]

This series change cpufreq-dt driver to return -EPROBE_DEFER in case OPP table
is not ready by the time it gets to initialize. The use case is for platforms
that use platform code (or maybe a different driver altogether) to provide OPP
tables.

In addition to changes to cpufreq-dt there are assorted changes to OPP code to
make it work a bit better.

Note that I am not happy with OPP code: dev_pm_opp_init_cpufreq_table() is
wrong in it's assumption that taking RCU lock will guarantee that number of
OPPs will not change - we can remove OPP from list just fine, and then
initialization will fail. I also think that we shoudl change API so users should
get a reference to their OPP table and then pass opaque dev_opp pointer to
accessor APIs instead of re-scanning the global list over and over and over.

Thanks,
Dmitry

Dmitry Torokhov (4):
PM / OPP: add some lockdep annotations
PM / OPP: fix warning in of_free_opp_table
PM / OPP: take RCU lock in dev_pm_opp_get_opp_count
cpufreq-dt: defer probing if OPP table is not ready

drivers/base/power/opp.c | 39 +++++++++++++++++++++++++++++++--------
drivers/cpufreq/cpufreq-dt.c | 11 +++++++++++
2 files changed, 42 insertions(+), 8 deletions(-)


2014-12-16 23:09:58

by Dmitry Torokhov

[permalink] [raw]
Subject: [PATCH 2/4] PM / OPP: fix warning in of_free_opp_table

Not having OPP defined for a device is not a crime, we should not splat
warning in this case. Also, it seems that we are ready to accept invalid
dev (find_device_opp will return ERR_PTR(-EINVAL) then) so let's not
crash in dev_name() in such case.

Signed-off-by: Dmitry Torokhov <[email protected]>
---
drivers/base/power/opp.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
index b78c14d..413c7fe 100644
--- a/drivers/base/power/opp.c
+++ b/drivers/base/power/opp.c
@@ -799,9 +799,15 @@ void of_free_opp_table(struct device *dev)

/* Check for existing list for 'dev' */
dev_opp = find_device_opp(dev);
- if (WARN(IS_ERR(dev_opp), "%s: dev_opp: %ld\n", dev_name(dev),
- PTR_ERR(dev_opp)))
+ if (IS_ERR(dev_opp)) {
+ int error = PTR_ERR(dev_opp);
+ if (error != -ENODEV)
+ WARN(1, "%s: dev_opp: %ld\n",
+ IS_ERR_OR_NULL(dev) ?
+ "Invalid device" : dev_name(dev),
+ error);
return;
+ }

/* Hold our list modification lock here */
mutex_lock(&dev_opp_list_lock);
--
2.2.0.rc0.207.ga3a616c

2014-12-16 23:09:57

by Dmitry Torokhov

[permalink] [raw]
Subject: [PATCH 1/4] PM / OPP: add some lockdep annotations

Certain OPP APIs need to be called under RCU lock; let's add a few
rcu_lockdep_assert() calls to warn about potential misuse.

Signed-off-by: Dmitry Torokhov <[email protected]>
---
drivers/base/power/opp.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
index d24dd614a..b78c14d 100644
--- a/drivers/base/power/opp.c
+++ b/drivers/base/power/opp.c
@@ -108,6 +108,14 @@ static LIST_HEAD(dev_opp_list);
/* Lock to allow exclusive modification to the device and opp lists */
static DEFINE_MUTEX(dev_opp_list_lock);

+#define opp_rcu_lockdep_assert() \
+do { \
+ rcu_lockdep_assert(rcu_read_lock_held() || \
+ lockdep_is_held(&dev_opp_list_lock), \
+ "Missing rcu_read_lock() or " \
+ "dev_opp_list_lock protection"); \
+} while (0)
+
/**
* find_device_opp() - find device_opp struct using device pointer
* @dev: device pointer used to lookup device OPPs
@@ -218,6 +226,8 @@ int dev_pm_opp_get_opp_count(struct device *dev)
struct dev_pm_opp *temp_opp;
int count = 0;

+ opp_rcu_lockdep_assert();
+
dev_opp = find_device_opp(dev);
if (IS_ERR(dev_opp)) {
int r = PTR_ERR(dev_opp);
@@ -267,6 +277,8 @@ struct dev_pm_opp *dev_pm_opp_find_freq_exact(struct device *dev,
struct device_opp *dev_opp;
struct dev_pm_opp *temp_opp, *opp = ERR_PTR(-ERANGE);

+ opp_rcu_lockdep_assert();
+
dev_opp = find_device_opp(dev);
if (IS_ERR(dev_opp)) {
int r = PTR_ERR(dev_opp);
@@ -313,6 +325,8 @@ struct dev_pm_opp *dev_pm_opp_find_freq_ceil(struct device *dev,
struct device_opp *dev_opp;
struct dev_pm_opp *temp_opp, *opp = ERR_PTR(-ERANGE);

+ opp_rcu_lockdep_assert();
+
if (!dev || !freq) {
dev_err(dev, "%s: Invalid argument freq=%p\n", __func__, freq);
return ERR_PTR(-EINVAL);
@@ -361,6 +375,8 @@ struct dev_pm_opp *dev_pm_opp_find_freq_floor(struct device *dev,
struct device_opp *dev_opp;
struct dev_pm_opp *temp_opp, *opp = ERR_PTR(-ERANGE);

+ opp_rcu_lockdep_assert();
+
if (!dev || !freq) {
dev_err(dev, "%s: Invalid argument freq=%p\n", __func__, freq);
return ERR_PTR(-EINVAL);
--
2.2.0.rc0.207.ga3a616c

2014-12-16 23:10:33

by Dmitry Torokhov

[permalink] [raw]
Subject: [PATCH 4/4] cpufreq-dt: defer probing if OPP table is not ready

cpufreq-dt driver supports mode when OPP table is provided by platform
code and not device tree. However on certain platforms code that fills
OPP table may run after cpufreq driver tries to initialize, so let's
report -EPROBE_DEFER if we do not find any entires in OPP table for the
CPU.

Signed-off-by: Dmitry Torokhov <[email protected]>
---
drivers/cpufreq/cpufreq-dt.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/drivers/cpufreq/cpufreq-dt.c b/drivers/cpufreq/cpufreq-dt.c
index f56147a..fde97d6 100644
--- a/drivers/cpufreq/cpufreq-dt.c
+++ b/drivers/cpufreq/cpufreq-dt.c
@@ -211,6 +211,17 @@ static int cpufreq_init(struct cpufreq_policy *policy)
/* OPPs might be populated at runtime, don't check for error here */
of_init_opp_table(cpu_dev);

+ /*
+ * But we need OPP table to function so if it is not there let's
+ * give platform code chance to provide it for us.
+ */
+ ret = dev_pm_opp_get_opp_count(cpu_dev);
+ if (ret <= 0) {
+ pr_debug("OPP table is not ready, deferring probe\n");
+ ret = -EPROBE_DEFER;
+ goto out_free_opp;
+ }
+
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv) {
ret = -ENOMEM;
--
2.2.0.rc0.207.ga3a616c

2014-12-16 23:10:54

by Dmitry Torokhov

[permalink] [raw]
Subject: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

A lot of callers are missing the fact that dev_pm_opp_get_opp_count
needs to be called under RCU lock. Given that RCU locks can safely be
nested, instead of providing *_locked() API, let's take RCU lock inside
dev_pm_opp_get_opp_count() and leave callers as is.

Signed-off-by: Dmitry Torokhov <[email protected]>
---
drivers/base/power/opp.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
index 413c7fe..ee5eca2 100644
--- a/drivers/base/power/opp.c
+++ b/drivers/base/power/opp.c
@@ -216,9 +216,7 @@ EXPORT_SYMBOL_GPL(dev_pm_opp_get_freq);
* This function returns the number of available opps if there are any,
* else returns 0 if none or the corresponding error value.
*
- * Locking: This function must be called under rcu_read_lock(). This function
- * internally references two RCU protected structures: device_opp and opp which
- * are safe as long as we are under a common RCU locked section.
+ * Locking: This function takes rcu_read_lock().
*/
int dev_pm_opp_get_opp_count(struct device *dev)
{
@@ -226,13 +224,14 @@ int dev_pm_opp_get_opp_count(struct device *dev)
struct dev_pm_opp *temp_opp;
int count = 0;

- opp_rcu_lockdep_assert();
+ rcu_read_lock();

dev_opp = find_device_opp(dev);
if (IS_ERR(dev_opp)) {
- int r = PTR_ERR(dev_opp);
- dev_err(dev, "%s: device OPP not found (%d)\n", __func__, r);
- return r;
+ count = PTR_ERR(dev_opp);
+ dev_err(dev, "%s: device OPP not found (%d)\n",
+ __func__, count);
+ goto out_unlock;
}

list_for_each_entry_rcu(temp_opp, &dev_opp->opp_list, node) {
@@ -240,6 +239,8 @@ int dev_pm_opp_get_opp_count(struct device *dev)
count++;
}

+out_unlock:
+ rcu_read_unlock();
return count;
}
EXPORT_SYMBOL_GPL(dev_pm_opp_get_opp_count);
--
2.2.0.rc0.207.ga3a616c

2014-12-17 04:10:44

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 1/4] PM / OPP: add some lockdep annotations

On 17 December 2014 at 04:39, Dmitry Torokhov <[email protected]> wrote:
> Certain OPP APIs need to be called under RCU lock; let's add a few
> rcu_lockdep_assert() calls to warn about potential misuse.
>
> Signed-off-by: Dmitry Torokhov <[email protected]>
> ---
> drivers/base/power/opp.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
> index d24dd614a..b78c14d 100644
> --- a/drivers/base/power/opp.c
> +++ b/drivers/base/power/opp.c
> @@ -108,6 +108,14 @@ static LIST_HEAD(dev_opp_list);
> /* Lock to allow exclusive modification to the device and opp lists */
> static DEFINE_MUTEX(dev_opp_list_lock);
>
> +#define opp_rcu_lockdep_assert() \
> +do { \
> + rcu_lockdep_assert(rcu_read_lock_held() || \
> + lockdep_is_held(&dev_opp_list_lock), \
> + "Missing rcu_read_lock() or " \
> + "dev_opp_list_lock protection"); \
> +} while (0)
> +
> /**
> * find_device_opp() - find device_opp struct using device pointer
> * @dev: device pointer used to lookup device OPPs
> @@ -218,6 +226,8 @@ int dev_pm_opp_get_opp_count(struct device *dev)
> struct dev_pm_opp *temp_opp;
> int count = 0;
>
> + opp_rcu_lockdep_assert();
> +
> dev_opp = find_device_opp(dev);
> if (IS_ERR(dev_opp)) {
> int r = PTR_ERR(dev_opp);
> @@ -267,6 +277,8 @@ struct dev_pm_opp *dev_pm_opp_find_freq_exact(struct device *dev,
> struct device_opp *dev_opp;
> struct dev_pm_opp *temp_opp, *opp = ERR_PTR(-ERANGE);
>
> + opp_rcu_lockdep_assert();
> +
> dev_opp = find_device_opp(dev);
> if (IS_ERR(dev_opp)) {
> int r = PTR_ERR(dev_opp);
> @@ -313,6 +325,8 @@ struct dev_pm_opp *dev_pm_opp_find_freq_ceil(struct device *dev,
> struct device_opp *dev_opp;
> struct dev_pm_opp *temp_opp, *opp = ERR_PTR(-ERANGE);
>
> + opp_rcu_lockdep_assert();
> +
> if (!dev || !freq) {
> dev_err(dev, "%s: Invalid argument freq=%p\n", __func__, freq);
> return ERR_PTR(-EINVAL);
> @@ -361,6 +375,8 @@ struct dev_pm_opp *dev_pm_opp_find_freq_floor(struct device *dev,
> struct device_opp *dev_opp;
> struct dev_pm_opp *temp_opp, *opp = ERR_PTR(-ERANGE);
>
> + opp_rcu_lockdep_assert();
> +
> if (!dev || !freq) {
> dev_err(dev, "%s: Invalid argument freq=%p\n", __func__, freq);
> return ERR_PTR(-EINVAL);

Acked-by: Viresh Kumar <[email protected]>

2014-12-17 04:28:20

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 2/4] PM / OPP: fix warning in of_free_opp_table

On 17 December 2014 at 04:39, Dmitry Torokhov <[email protected]> wrote:
> Not having OPP defined for a device is not a crime, we should not splat
> warning in this case. Also, it seems that we are ready to accept invalid
> dev (find_device_opp will return ERR_PTR(-EINVAL) then) so let's not
> crash in dev_name() in such case.
>
> Signed-off-by: Dmitry Torokhov <[email protected]>
> ---
> drivers/base/power/opp.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
> index b78c14d..413c7fe 100644
> --- a/drivers/base/power/opp.c
> +++ b/drivers/base/power/opp.c
> @@ -799,9 +799,15 @@ void of_free_opp_table(struct device *dev)
>
> /* Check for existing list for 'dev' */
> dev_opp = find_device_opp(dev);
> - if (WARN(IS_ERR(dev_opp), "%s: dev_opp: %ld\n", dev_name(dev),
> - PTR_ERR(dev_opp)))
> + if (IS_ERR(dev_opp)) {
> + int error = PTR_ERR(dev_opp);
> + if (error != -ENODEV)
> + WARN(1, "%s: dev_opp: %ld\n",
> + IS_ERR_OR_NULL(dev) ?
> + "Invalid device" : dev_name(dev),
> + error);
> return;

What about this:

if (IS_ERR(dev_opp)) {
int error = PTR_ERR(dev_opp);
WARN(error != -ENODEV, "%s: dev_opp: %ld\n",
IS_ERR_OR_NULL(dev) ? "Invalid device" : dev_name(dev),
error);
return;
}

We can get rid of the extra indentation level and an extra comparison check.

Otherwise:

Acked-by: Viresh Kumar <[email protected]>

2014-12-17 04:36:21

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

On 17 December 2014 at 04:39, Dmitry Torokhov <[email protected]> wrote:
> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
> needs to be called under RCU lock. Given that RCU locks can safely be
> nested, instead of providing *_locked() API, let's take RCU lock inside

Hmm, I asked for a *_locked() API because many users of
dev_pm_opp_get_opp_count() are already calling it from rcu read side
critical sections.

Now, there are two questions:
- Can rcu-read side critical sections be nested ?

Yes, this is what the comment over rcu_read_lock() says

* RCU read-side critical sections may be nested. Any deferred actions
* will be deferred until the outermost RCU read-side critical section
* completes.

- Would it be better to drop these double rcu_read_locks() ? i.e. either
get a *_locked() API or fix the callers of dev_pm_opp_get_opp_count().

@Paul: What do you say ?

> dev_pm_opp_get_opp_count() and leave callers as is.
>
> Signed-off-by: Dmitry Torokhov <[email protected]>
> ---
> drivers/base/power/opp.c | 15 ++++++++-------
> 1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
> index 413c7fe..ee5eca2 100644
> --- a/drivers/base/power/opp.c
> +++ b/drivers/base/power/opp.c
> @@ -216,9 +216,7 @@ EXPORT_SYMBOL_GPL(dev_pm_opp_get_freq);
> * This function returns the number of available opps if there are any,
> * else returns 0 if none or the corresponding error value.
> *
> - * Locking: This function must be called under rcu_read_lock(). This function
> - * internally references two RCU protected structures: device_opp and opp which
> - * are safe as long as we are under a common RCU locked section.
> + * Locking: This function takes rcu_read_lock().
> */
> int dev_pm_opp_get_opp_count(struct device *dev)
> {
> @@ -226,13 +224,14 @@ int dev_pm_opp_get_opp_count(struct device *dev)
> struct dev_pm_opp *temp_opp;
> int count = 0;
>
> - opp_rcu_lockdep_assert();
> + rcu_read_lock();
>
> dev_opp = find_device_opp(dev);
> if (IS_ERR(dev_opp)) {
> - int r = PTR_ERR(dev_opp);
> - dev_err(dev, "%s: device OPP not found (%d)\n", __func__, r);
> - return r;
> + count = PTR_ERR(dev_opp);
> + dev_err(dev, "%s: device OPP not found (%d)\n",
> + __func__, count);
> + goto out_unlock;
> }
>
> list_for_each_entry_rcu(temp_opp, &dev_opp->opp_list, node) {
> @@ -240,6 +239,8 @@ int dev_pm_opp_get_opp_count(struct device *dev)
> count++;
> }
>
> +out_unlock:
> + rcu_read_unlock();
> return count;
> }

Looked fine otherwise:

Acked-by: Viresh Kumar <[email protected]>

2014-12-17 04:37:04

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 4/4] cpufreq-dt: defer probing if OPP table is not ready

On 17 December 2014 at 04:39, Dmitry Torokhov <[email protected]> wrote:
> cpufreq-dt driver supports mode when OPP table is provided by platform
> code and not device tree. However on certain platforms code that fills
> OPP table may run after cpufreq driver tries to initialize, so let's
> report -EPROBE_DEFER if we do not find any entires in OPP table for the
> CPU.
>
> Signed-off-by: Dmitry Torokhov <[email protected]>
> ---
> drivers/cpufreq/cpufreq-dt.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/drivers/cpufreq/cpufreq-dt.c b/drivers/cpufreq/cpufreq-dt.c
> index f56147a..fde97d6 100644
> --- a/drivers/cpufreq/cpufreq-dt.c
> +++ b/drivers/cpufreq/cpufreq-dt.c
> @@ -211,6 +211,17 @@ static int cpufreq_init(struct cpufreq_policy *policy)
> /* OPPs might be populated at runtime, don't check for error here */
> of_init_opp_table(cpu_dev);
>
> + /*
> + * But we need OPP table to function so if it is not there let's
> + * give platform code chance to provide it for us.
> + */
> + ret = dev_pm_opp_get_opp_count(cpu_dev);
> + if (ret <= 0) {
> + pr_debug("OPP table is not ready, deferring probe\n");
> + ret = -EPROBE_DEFER;
> + goto out_free_opp;
> + }
> +
> priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> if (!priv) {
> ret = -ENOMEM;

Acked-by: Viresh Kumar <[email protected]>

2014-12-17 04:42:29

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 0/4] Allow cpufreq-dt to defer probe if OPP table is not ready

On 17 December 2014 at 04:39, Dmitry Torokhov <[email protected]> wrote:
> Note that I am not happy with OPP code: dev_pm_opp_init_cpufreq_table() is
> wrong in it's assumption that taking RCU lock will guarantee that number of
> OPPs will not change - we can remove OPP from list just fine, and then
> initialization will fail. I also think that we shoudl change API so users should
> get a reference to their OPP table and then pass opaque dev_opp pointer to
> accessor APIs instead of re-scanning the global list over and over and over.

Yeah, agree for both of them..

2014-12-17 17:28:46

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

On Wednesday, December 17, 2014 10:06:17 AM Viresh Kumar wrote:
> On 17 December 2014 at 04:39, Dmitry Torokhov <[email protected]> wrote:
> > A lot of callers are missing the fact that dev_pm_opp_get_opp_count
> > needs to be called under RCU lock. Given that RCU locks can safely be
> > nested, instead of providing *_locked() API, let's take RCU lock inside
>
> Hmm, I asked for a *_locked() API because many users of
> dev_pm_opp_get_opp_count() are already calling it from rcu read side
> critical sections.
>
> Now, there are two questions:
> - Can rcu-read side critical sections be nested ?
>
> Yes, this is what the comment over rcu_read_lock() says
>
> * RCU read-side critical sections may be nested. Any deferred actions
> * will be deferred until the outermost RCU read-side critical section
> * completes.
>
> - Would it be better to drop these double rcu_read_locks() ? i.e. either
> get a *_locked() API or fix the callers of dev_pm_opp_get_opp_count().
>
> @Paul: What do you say ?
>

FWIW the change is a stop-gap; I hope we'll get away from using
dev_pm_opp_get_opp_count() in cpufreq drivers and then we can revert the
change. I just did not want to touch cpufreq drivers unless necessary.

Thanks,
Dmitry

2014-12-17 23:47:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

On Wed, Dec 17, 2014 at 10:06:17AM +0530, Viresh Kumar wrote:
> On 17 December 2014 at 04:39, Dmitry Torokhov <[email protected]> wrote:
> > A lot of callers are missing the fact that dev_pm_opp_get_opp_count
> > needs to be called under RCU lock. Given that RCU locks can safely be
> > nested, instead of providing *_locked() API, let's take RCU lock inside
>
> Hmm, I asked for a *_locked() API because many users of
> dev_pm_opp_get_opp_count() are already calling it from rcu read side
> critical sections.
>
> Now, there are two questions:
> - Can rcu-read side critical sections be nested ?
>
> Yes, this is what the comment over rcu_read_lock() says
>
> * RCU read-side critical sections may be nested. Any deferred actions
> * will be deferred until the outermost RCU read-side critical section
> * completes.
>
> - Would it be better to drop these double rcu_read_locks() ? i.e. either
> get a *_locked() API or fix the callers of dev_pm_opp_get_opp_count().
>
> @Paul: What do you say ?

Yep, they can be nested. Both rcu_read_lock() and rcu_read_unlock()
are quite fast, as are their friends, so there is almost no performance
penalty from nesting. So the decision normally turns on maintainability
and style.

Thanx, Paul

> > dev_pm_opp_get_opp_count() and leave callers as is.
> >
> > Signed-off-by: Dmitry Torokhov <[email protected]>
> > ---
> > drivers/base/power/opp.c | 15 ++++++++-------
> > 1 file changed, 8 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
> > index 413c7fe..ee5eca2 100644
> > --- a/drivers/base/power/opp.c
> > +++ b/drivers/base/power/opp.c
> > @@ -216,9 +216,7 @@ EXPORT_SYMBOL_GPL(dev_pm_opp_get_freq);
> > * This function returns the number of available opps if there are any,
> > * else returns 0 if none or the corresponding error value.
> > *
> > - * Locking: This function must be called under rcu_read_lock(). This function
> > - * internally references two RCU protected structures: device_opp and opp which
> > - * are safe as long as we are under a common RCU locked section.
> > + * Locking: This function takes rcu_read_lock().
> > */
> > int dev_pm_opp_get_opp_count(struct device *dev)
> > {
> > @@ -226,13 +224,14 @@ int dev_pm_opp_get_opp_count(struct device *dev)
> > struct dev_pm_opp *temp_opp;
> > int count = 0;
> >
> > - opp_rcu_lockdep_assert();
> > + rcu_read_lock();
> >
> > dev_opp = find_device_opp(dev);
> > if (IS_ERR(dev_opp)) {
> > - int r = PTR_ERR(dev_opp);
> > - dev_err(dev, "%s: device OPP not found (%d)\n", __func__, r);
> > - return r;
> > + count = PTR_ERR(dev_opp);
> > + dev_err(dev, "%s: device OPP not found (%d)\n",
> > + __func__, count);
> > + goto out_unlock;
> > }
> >
> > list_for_each_entry_rcu(temp_opp, &dev_opp->opp_list, node) {
> > @@ -240,6 +239,8 @@ int dev_pm_opp_get_opp_count(struct device *dev)
> > count++;
> > }
> >
> > +out_unlock:
> > + rcu_read_unlock();
> > return count;
> > }
>
> Looked fine otherwise:
>
> Acked-by: Viresh Kumar <[email protected]>
>

2014-12-18 02:11:17

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

On 18 December 2014 at 05:17, Paul E. McKenney
<[email protected]> wrote:
> Yep, they can be nested. Both rcu_read_lock() and rcu_read_unlock()
> are quite fast, as are their friends, so there is almost no performance
> penalty from nesting. So the decision normally turns on maintainability
> and style.

Thanks again for your kind advice :)

2014-12-18 02:12:01

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

On 17 December 2014 at 22:58, Dmitry Torokhov <[email protected]> wrote:
> FWIW the change is a stop-gap; I hope we'll get away from using
> dev_pm_opp_get_opp_count() in cpufreq drivers and then we can revert the
> change. I just did not want to touch cpufreq drivers unless necessary.

Yeah, all good now. No more questions :)

2014-12-24 16:29:03

by Nishanth Menon

[permalink] [raw]
Subject: Re: [PATCH 1/4] PM / OPP: add some lockdep annotations

On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
> Certain OPP APIs need to be called under RCU lock; let's add a few
> rcu_lockdep_assert() calls to warn about potential misuse.
>
> Signed-off-by: Dmitry Torokhov <[email protected]>
> ---
> drivers/base/power/opp.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
> index d24dd614a..b78c14d 100644
> --- a/drivers/base/power/opp.c
> +++ b/drivers/base/power/opp.c
> @@ -108,6 +108,14 @@ static LIST_HEAD(dev_opp_list);
> /* Lock to allow exclusive modification to the device and opp lists */
> static DEFINE_MUTEX(dev_opp_list_lock);
>
> +#define opp_rcu_lockdep_assert() \
> +do { \
> + rcu_lockdep_assert(rcu_read_lock_held() || \
> + lockdep_is_held(&dev_opp_list_lock), \
> + "Missing rcu_read_lock() or " \
> + "dev_opp_list_lock protection"); \
> +} while (0)
> +
> /**
> * find_device_opp() - find device_opp struct using device pointer
> * @dev: device pointer used to lookup device OPPs
> @@ -218,6 +226,8 @@ int dev_pm_opp_get_opp_count(struct device *dev)
> struct dev_pm_opp *temp_opp;
> int count = 0;
>
> + opp_rcu_lockdep_assert();
> +
> dev_opp = find_device_opp(dev);
> if (IS_ERR(dev_opp)) {
> int r = PTR_ERR(dev_opp);
> @@ -267,6 +277,8 @@ struct dev_pm_opp *dev_pm_opp_find_freq_exact(struct device *dev,
> struct device_opp *dev_opp;
> struct dev_pm_opp *temp_opp, *opp = ERR_PTR(-ERANGE);
>
> + opp_rcu_lockdep_assert();
> +
> dev_opp = find_device_opp(dev);
> if (IS_ERR(dev_opp)) {
> int r = PTR_ERR(dev_opp);
> @@ -313,6 +325,8 @@ struct dev_pm_opp *dev_pm_opp_find_freq_ceil(struct device *dev,
> struct device_opp *dev_opp;
> struct dev_pm_opp *temp_opp, *opp = ERR_PTR(-ERANGE);
>
> + opp_rcu_lockdep_assert();
> +
> if (!dev || !freq) {
> dev_err(dev, "%s: Invalid argument freq=%p\n", __func__, freq);
> return ERR_PTR(-EINVAL);
> @@ -361,6 +375,8 @@ struct dev_pm_opp *dev_pm_opp_find_freq_floor(struct device *dev,
> struct device_opp *dev_opp;
> struct dev_pm_opp *temp_opp, *opp = ERR_PTR(-ERANGE);
>
> + opp_rcu_lockdep_assert();
> +
> if (!dev || !freq) {
> dev_err(dev, "%s: Invalid argument freq=%p\n", __func__, freq);
> return ERR_PTR(-EINVAL);
>

You should also add opp_rcu_lockdep_assert to the following functions:
dev_pm_opp_get_voltage and dev_pm_opp_get_freq - both of which must be
used under rcu read lock.

dev_pm_opp_get_notifier references the RCU protected dev_opp list ->
so that must also be under rcu read lock.


love the concept, and I suggest splitting this into the following:

RCU readers: trivial as rcu_read_lock_help dumps
RCU updates: (adding, updating, removing): for the helper functions
ensure mutex_lock held assertion.



--
Regards,
Nishanth Menon

2014-12-24 16:42:39

by Nishanth Menon

[permalink] [raw]
Subject: Re: [PATCH 2/4] PM / OPP: fix warning in of_free_opp_table

$subject:
PM / OPP: Do not warn when no OPP was ever registered in of_free_opp_table

might be more appropriate?

On 12/16/2014 10:28 PM, Viresh Kumar wrote:
> On 17 December 2014 at 04:39, Dmitry Torokhov <[email protected]> wrote:
>> Not having OPP defined for a device is not a crime, we should not splat
>> warning in this case. Also, it seems that we are ready to accept invalid
>> dev (find_device_opp will return ERR_PTR(-EINVAL) then) so let's not
>> crash in dev_name() in such case.
>>
>> Signed-off-by: Dmitry Torokhov <[email protected]>
>> ---
>> drivers/base/power/opp.c | 10 ++++++++--
>> 1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
>> index b78c14d..413c7fe 100644
>> --- a/drivers/base/power/opp.c
>> +++ b/drivers/base/power/opp.c
>> @@ -799,9 +799,15 @@ void of_free_opp_table(struct device *dev)
>>
>> /* Check for existing list for 'dev' */
>> dev_opp = find_device_opp(dev);
>> - if (WARN(IS_ERR(dev_opp), "%s: dev_opp: %ld\n", dev_name(dev),
>> - PTR_ERR(dev_opp)))
>> + if (IS_ERR(dev_opp)) {
>> + int error = PTR_ERR(dev_opp);
>> + if (error != -ENODEV)
>> + WARN(1, "%s: dev_opp: %ld\n",
>> + IS_ERR_OR_NULL(dev) ?
>> + "Invalid device" : dev_name(dev),
>> + error);
>> return;
>
> What about this:
>
> if (IS_ERR(dev_opp)) {
> int error = PTR_ERR(dev_opp);
> WARN(error != -ENODEV, "%s: dev_opp: %ld\n",
> IS_ERR_OR_NULL(dev) ? "Invalid device" : dev_name(dev),
> error);
> return;
> }
Adds the following build warning: I suggest changing the %ld to %d

"warning: format ‘%ld’ expects argument of type ‘long int’, but
argument 5 has type ‘int’ [-Wformat]"


This also fixes the warning I got on AM437x based platforms. test results:
1: am437x-sk: BOOT: PASS: crit=2 err=13 warn=25, CPUFreq: N/A,
CPUIdle: N/A: http://slexy.org/raw/s20MAtNEbt
2: am437x-sk-before: BOOT: PASS: crit=2 err=13 warn=57, CPUFreq: N/A,
CPUIdle: N/A: http://slexy.org/raw/s21rZCo4cD

With the suggested changes,

Acked-by: Nishanth Menon <[email protected]>

>
> We can get rid of the extra indentation level and an extra comparison check.
>
> Otherwise:
>
> Acked-by: Viresh Kumar <[email protected]>


PS: I am sure you have already done some level of checks, but I
suggest using something like aiaiai or
https://github.com/nmenon/kernel_patch_verify or other solutions like
those discussed in
https://plus.google.com/112464029509057661457/posts/5LTuHHMyXLT to
help do some initial clean up of the patches..


--
Regards,
Nishanth Menon

2014-12-24 16:48:34

by Nishanth Menon

[permalink] [raw]
Subject: Re: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
> needs to be called under RCU lock. Given that RCU locks can safely be
> nested, instead of providing *_locked() API, let's take RCU lock inside
> dev_pm_opp_get_opp_count() and leave callers as is.

While it is true that we can safely do nested RCU locks, This also
encourages wrong usage.

count = dev_pm_opp_get_opp_count(dev)
^^ point A
array = kzalloc(count * sizeof (*array));
rcu_read_lock();
^^ point B
.. work down the list and add OPPs..
...

Between A and B, we might have had list modification (dynamic OPP
addition or deletion) - which implies that the count is no longer
accurate between point A and B. instead, enforcing callers to have the
responsibility of rcu_lock is exactly what we have to do since the OPP
library has no clue how to enforce pointer or data accuracy.

Sorry, NAK. this setsup stage for further similar additions to
get_voltage, freq etc.. basically encourages errorneous usage of the
functions and misinterpretation of data procured.

>
> Signed-off-by: Dmitry Torokhov <[email protected]>
> ---
> drivers/base/power/opp.c | 15 ++++++++-------
> 1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
> index 413c7fe..ee5eca2 100644
> --- a/drivers/base/power/opp.c
> +++ b/drivers/base/power/opp.c
> @@ -216,9 +216,7 @@ EXPORT_SYMBOL_GPL(dev_pm_opp_get_freq);
> * This function returns the number of available opps if there are any,
> * else returns 0 if none or the corresponding error value.
> *
> - * Locking: This function must be called under rcu_read_lock(). This function
> - * internally references two RCU protected structures: device_opp and opp which
> - * are safe as long as we are under a common RCU locked section.
> + * Locking: This function takes rcu_read_lock().
> */
> int dev_pm_opp_get_opp_count(struct device *dev)
> {
> @@ -226,13 +224,14 @@ int dev_pm_opp_get_opp_count(struct device *dev)
> struct dev_pm_opp *temp_opp;
> int count = 0;
>
> - opp_rcu_lockdep_assert();
> + rcu_read_lock();
>
> dev_opp = find_device_opp(dev);
> if (IS_ERR(dev_opp)) {
> - int r = PTR_ERR(dev_opp);
> - dev_err(dev, "%s: device OPP not found (%d)\n", __func__, r);
> - return r;
> + count = PTR_ERR(dev_opp);
> + dev_err(dev, "%s: device OPP not found (%d)\n",
> + __func__, count);
> + goto out_unlock;
> }
>
> list_for_each_entry_rcu(temp_opp, &dev_opp->opp_list, node) {
> @@ -240,6 +239,8 @@ int dev_pm_opp_get_opp_count(struct device *dev)
> count++;
> }
>
> +out_unlock:
> + rcu_read_unlock();
> return count;
> }
> EXPORT_SYMBOL_GPL(dev_pm_opp_get_opp_count);
>
--
Regards,
Nishanth Menon

2014-12-24 16:58:39

by Nishanth Menon

[permalink] [raw]
Subject: Re: [PATCH 4/4] cpufreq-dt: defer probing if OPP table is not ready

On 12/16/2014 10:37 PM, Viresh Kumar wrote:
> On 17 December 2014 at 04:39, Dmitry Torokhov <[email protected]> wrote:
>> cpufreq-dt driver supports mode when OPP table is provided by platform
>> code and not device tree. However on certain platforms code that fills
>> OPP table may run after cpufreq driver tries to initialize, so let's
>> report -EPROBE_DEFER if we do not find any entires in OPP table for the
>> CPU.
>>
>> Signed-off-by: Dmitry Torokhov <[email protected]>
>> ---
>> drivers/cpufreq/cpufreq-dt.c | 11 +++++++++++
>> 1 file changed, 11 insertions(+)
>>
>> diff --git a/drivers/cpufreq/cpufreq-dt.c b/drivers/cpufreq/cpufreq-dt.c
>> index f56147a..fde97d6 100644
>> --- a/drivers/cpufreq/cpufreq-dt.c
>> +++ b/drivers/cpufreq/cpufreq-dt.c
>> @@ -211,6 +211,17 @@ static int cpufreq_init(struct cpufreq_policy *policy)
>> /* OPPs might be populated at runtime, don't check for error here */
>> of_init_opp_table(cpu_dev);
>>
>> + /*
>> + * But we need OPP table to function so if it is not there let's
>> + * give platform code chance to provide it for us.
>> + */
>> + ret = dev_pm_opp_get_opp_count(cpu_dev);
>> + if (ret <= 0) {
>> + pr_debug("OPP table is not ready, deferring probe\n");
>> + ret = -EPROBE_DEFER;
>> + goto out_free_opp;
Umm.. why would I be here? because of_init_opp_table(cpu_dev) did not
find any OPPs at all. but that is OK, coz, we expect dynamic addition
of OPPs by someone else... now,

case 1:
if I did have few OPPs in DT (of_init_opp_table) and cpufreq_init gets
called *before* my dynamic OPPs are added(example based on efuse
entries) -> I'd already have built my dev_pm_opp_init_cpufreq_table
and that wont be accurate. but at that point dev_pm_opp_get_opp_count
cant fail either.. and if cpufreq_init was called *after* dynamic OPPs
are added - correct table, AND dev_pm_opp_get_opp_count wont fail either.

case 2:
If I had 0 OPPs in DT (of_init_opp_table fails) and cpufreq_init gets
called *before* my dynamic OPPs are added(example based on efuse
entries) -> doing a goto out_free_opp and of_free_opp_table is of no
use at all..

case 3:
If I had 0 OPPs in DT (of_init_opp_table fails) and cpufreq_init gets
called *after* my dynamic OPPs are added(example based on efuse
entries) -> I wont have error dev_pm_opp_get_opp_count.

So, you'd better want to introduce an intermediate goto for of_node_put

Also, please use dev_dbg instead of pr_*, since you already have
cpu_dev pointer.

>> + }
>> +
>> priv = kzalloc(sizeof(*priv), GFP_KERNEL);
>> if (!priv) {
>> ret = -ENOMEM;
>
> Acked-by: Viresh Kumar <[email protected]>
>


--
Regards,
Nishanth Menon

2014-12-24 17:09:07

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

On Wed, Dec 24, 2014 at 8:48 AM, Nishanth Menon <[email protected]> wrote:
> On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
>> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
>> needs to be called under RCU lock. Given that RCU locks can safely be
>> nested, instead of providing *_locked() API, let's take RCU lock inside
>> dev_pm_opp_get_opp_count() and leave callers as is.
>
> While it is true that we can safely do nested RCU locks, This also
> encourages wrong usage.
>
> count = dev_pm_opp_get_opp_count(dev)
> ^^ point A
> array = kzalloc(count * sizeof (*array));
> rcu_read_lock();
> ^^ point B
> .. work down the list and add OPPs..
> ...
>
> Between A and B, we might have had list modification (dynamic OPP
> addition or deletion) - which implies that the count is no longer
> accurate between point A and B. instead, enforcing callers to have the
> responsibility of rcu_lock is exactly what we have to do since the OPP
> library has no clue how to enforce pointer or data accuracy.

No, you seem to have a misconception that rcu_lock protects you past
the point B, but that is also wrong. The only thing rcu "lock"
provides is safe traversing the list and guarantee that elements will
not disappear while you are referencing them, but list can both
contract and expand under you. In that regard code in
drivers/cpufreq/cpufreq_opp.c is utterly wrong. If you want to count
the list and use number of elements you should be taking a mutex.
Luckily all cpufreq drivers at the moment only want to see if OPP
table is empty or not, so as a stop-gap we can take rcu_lock
automatically as we are getting count. We won't get necessarily
accurate result, but at least we will be safe traversing the list.

Thanks.

--
Dmitry

2014-12-24 17:16:53

by Nishanth Menon

[permalink] [raw]
Subject: Re: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

On 12/24/2014 11:09 AM, Dmitry Torokhov wrote:
> On Wed, Dec 24, 2014 at 8:48 AM, Nishanth Menon <[email protected]> wrote:
>> On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
>>> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
>>> needs to be called under RCU lock. Given that RCU locks can safely be
>>> nested, instead of providing *_locked() API, let's take RCU lock inside
>>> dev_pm_opp_get_opp_count() and leave callers as is.
>>
>> While it is true that we can safely do nested RCU locks, This also
>> encourages wrong usage.
>>
>> count = dev_pm_opp_get_opp_count(dev)
>> ^^ point A
>> array = kzalloc(count * sizeof (*array));
>> rcu_read_lock();
>> ^^ point B
>> .. work down the list and add OPPs..
>> ...
>>
>> Between A and B, we might have had list modification (dynamic OPP
>> addition or deletion) - which implies that the count is no longer
>> accurate between point A and B. instead, enforcing callers to have the
>> responsibility of rcu_lock is exactly what we have to do since the OPP
>> library has no clue how to enforce pointer or data accuracy.
>
> No, you seem to have a misconception that rcu_lock protects you past
> the point B, but that is also wrong. The only thing rcu "lock"
> provides is safe traversing the list and guarantee that elements will
> not disappear while you are referencing them, but list can both
> contract and expand under you. In that regard code in
> drivers/cpufreq/cpufreq_opp.c is utterly wrong. If you want to count
> the list and use number of elements you should be taking a mutex.
> Luckily all cpufreq drivers at the moment only want to see if OPP
> table is empty or not, so as a stop-gap we can take rcu_lock
> automatically as we are getting count. We won't get necessarily
> accurate result, but at least we will be safe traversing the list.

So, instead of a half solution, lets consider this in the realm of
dynamic OPPs as well. agreed to the point that we only have safe
traversal and pointer validity. the real problem however is with
"dynamic OPPs" (one of the original reasons why i did not add dynamic
OPPs in the original version was to escape from it's complexity for
users - anyways.. we are beyond that now). if OPPs can be removed on
the fly, we need the following:
a) use OPP notifiers to adequately handle list modification
b) lock down list modification (and associated APIs) to ensure that
the original cpufreq /devfreq list is correct.

I still dont see the need to do this half solution.


--
Regards,
Nishanth Menon

2014-12-24 17:31:47

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

On Wed, Dec 24, 2014 at 9:16 AM, Nishanth Menon <[email protected]> wrote:
> On 12/24/2014 11:09 AM, Dmitry Torokhov wrote:
>> On Wed, Dec 24, 2014 at 8:48 AM, Nishanth Menon <[email protected]> wrote:
>>> On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
>>>> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
>>>> needs to be called under RCU lock. Given that RCU locks can safely be
>>>> nested, instead of providing *_locked() API, let's take RCU lock inside
>>>> dev_pm_opp_get_opp_count() and leave callers as is.
>>>
>>> While it is true that we can safely do nested RCU locks, This also
>>> encourages wrong usage.
>>>
>>> count = dev_pm_opp_get_opp_count(dev)
>>> ^^ point A
>>> array = kzalloc(count * sizeof (*array));
>>> rcu_read_lock();
>>> ^^ point B
>>> .. work down the list and add OPPs..
>>> ...
>>>
>>> Between A and B, we might have had list modification (dynamic OPP
>>> addition or deletion) - which implies that the count is no longer
>>> accurate between point A and B. instead, enforcing callers to have the
>>> responsibility of rcu_lock is exactly what we have to do since the OPP
>>> library has no clue how to enforce pointer or data accuracy.
>>
>> No, you seem to have a misconception that rcu_lock protects you past
>> the point B, but that is also wrong. The only thing rcu "lock"
>> provides is safe traversing the list and guarantee that elements will
>> not disappear while you are referencing them, but list can both
>> contract and expand under you. In that regard code in
>> drivers/cpufreq/cpufreq_opp.c is utterly wrong. If you want to count
>> the list and use number of elements you should be taking a mutex.
>> Luckily all cpufreq drivers at the moment only want to see if OPP
>> table is empty or not, so as a stop-gap we can take rcu_lock
>> automatically as we are getting count. We won't get necessarily
>> accurate result, but at least we will be safe traversing the list.
>
> So, instead of a half solution, lets consider this in the realm of
> dynamic OPPs as well. agreed to the point that we only have safe
> traversal and pointer validity. the real problem however is with
> "dynamic OPPs" (one of the original reasons why i did not add dynamic
> OPPs in the original version was to escape from it's complexity for
> users - anyways.. we are beyond that now). if OPPs can be removed on
> the fly, we need the following:
> a) use OPP notifiers to adequately handle list modification
> b) lock down list modification (and associated APIs) to ensure that
> the original cpufreq /devfreq list is correct.
>
> I still dont see the need to do this half solution.

The need for half solution at the moment is that you can't safely
travel the lists and may crash on an invalid pointer.

Going forward I think (I mentioned that in my other email) that we
should rework the OPP API so that callers fetch OPP table object for a
device at init/probe time and then use it to get OPPs. This way won't
have to travel two lists any time we want to reference an OPP.

And instead of relying notifiers, maybe look into using OPP tables
directly in cpufreq drivers instead of converting OPP into static-ish
cpufreq tables.

Thanks.

--
Dmitry

2014-12-24 17:37:21

by Nishanth Menon

[permalink] [raw]
Subject: Re: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

On 12/24/2014 11:31 AM, Dmitry Torokhov wrote:
> On Wed, Dec 24, 2014 at 9:16 AM, Nishanth Menon <[email protected]> wrote:
>> On 12/24/2014 11:09 AM, Dmitry Torokhov wrote:
>>> On Wed, Dec 24, 2014 at 8:48 AM, Nishanth Menon <[email protected]> wrote:
>>>> On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
>>>>> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
>>>>> needs to be called under RCU lock. Given that RCU locks can safely be
>>>>> nested, instead of providing *_locked() API, let's take RCU lock inside
>>>>> dev_pm_opp_get_opp_count() and leave callers as is.
>>>>
>>>> While it is true that we can safely do nested RCU locks, This also
>>>> encourages wrong usage.
>>>>
>>>> count = dev_pm_opp_get_opp_count(dev)
>>>> ^^ point A
>>>> array = kzalloc(count * sizeof (*array));
>>>> rcu_read_lock();
>>>> ^^ point B
>>>> .. work down the list and add OPPs..
>>>> ...
>>>>
>>>> Between A and B, we might have had list modification (dynamic OPP
>>>> addition or deletion) - which implies that the count is no longer
>>>> accurate between point A and B. instead, enforcing callers to have the
>>>> responsibility of rcu_lock is exactly what we have to do since the OPP
>>>> library has no clue how to enforce pointer or data accuracy.
>>>
>>> No, you seem to have a misconception that rcu_lock protects you past
>>> the point B, but that is also wrong. The only thing rcu "lock"
>>> provides is safe traversing the list and guarantee that elements will
>>> not disappear while you are referencing them, but list can both
>>> contract and expand under you. In that regard code in
>>> drivers/cpufreq/cpufreq_opp.c is utterly wrong. If you want to count
>>> the list and use number of elements you should be taking a mutex.
>>> Luckily all cpufreq drivers at the moment only want to see if OPP
>>> table is empty or not, so as a stop-gap we can take rcu_lock
>>> automatically as we are getting count. We won't get necessarily
>>> accurate result, but at least we will be safe traversing the list.
>>
>> So, instead of a half solution, lets consider this in the realm of
>> dynamic OPPs as well. agreed to the point that we only have safe
>> traversal and pointer validity. the real problem however is with
>> "dynamic OPPs" (one of the original reasons why i did not add dynamic
>> OPPs in the original version was to escape from it's complexity for
>> users - anyways.. we are beyond that now). if OPPs can be removed on
>> the fly, we need the following:
>> a) use OPP notifiers to adequately handle list modification
>> b) lock down list modification (and associated APIs) to ensure that
>> the original cpufreq /devfreq list is correct.
>>
>> I still dont see the need to do this half solution.
>
> The need for half solution at the moment is that you can't safely
> travel the lists and may crash on an invalid pointer.

So, fix the cpufreq-dt instead of moving the hack inside OPP driver.

>
> Going forward I think (I mentioned that in my other email) that we
> should rework the OPP API so that callers fetch OPP table object for a
> device at init/probe time and then use it to get OPPs. This way won't
> have to travel two lists any time we want to reference an OPP.
>
> And instead of relying notifiers, maybe look into using OPP tables
> directly in cpufreq drivers instead of converting OPP into static-ish
> cpufreq tables.
>

If you'd like a proper fix for OPP usage, I am all open to see such a
proposal that works not just for cpufreq, but also for devfreq as well.

--
Regards,
Nishanth Menon

2014-12-24 17:44:49

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

On Wed, Dec 24, 2014 at 9:37 AM, Nishanth Menon <[email protected]> wrote:
> On 12/24/2014 11:31 AM, Dmitry Torokhov wrote:
>> On Wed, Dec 24, 2014 at 9:16 AM, Nishanth Menon <[email protected]> wrote:
>>> On 12/24/2014 11:09 AM, Dmitry Torokhov wrote:
>>>> On Wed, Dec 24, 2014 at 8:48 AM, Nishanth Menon <[email protected]> wrote:
>>>>> On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
>>>>>> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
>>>>>> needs to be called under RCU lock. Given that RCU locks can safely be
>>>>>> nested, instead of providing *_locked() API, let's take RCU lock inside
>>>>>> dev_pm_opp_get_opp_count() and leave callers as is.
>>>>>
>>>>> While it is true that we can safely do nested RCU locks, This also
>>>>> encourages wrong usage.
>>>>>
>>>>> count = dev_pm_opp_get_opp_count(dev)
>>>>> ^^ point A
>>>>> array = kzalloc(count * sizeof (*array));
>>>>> rcu_read_lock();
>>>>> ^^ point B
>>>>> .. work down the list and add OPPs..
>>>>> ...
>>>>>
>>>>> Between A and B, we might have had list modification (dynamic OPP
>>>>> addition or deletion) - which implies that the count is no longer
>>>>> accurate between point A and B. instead, enforcing callers to have the
>>>>> responsibility of rcu_lock is exactly what we have to do since the OPP
>>>>> library has no clue how to enforce pointer or data accuracy.
>>>>
>>>> No, you seem to have a misconception that rcu_lock protects you past
>>>> the point B, but that is also wrong. The only thing rcu "lock"
>>>> provides is safe traversing the list and guarantee that elements will
>>>> not disappear while you are referencing them, but list can both
>>>> contract and expand under you. In that regard code in
>>>> drivers/cpufreq/cpufreq_opp.c is utterly wrong. If you want to count
>>>> the list and use number of elements you should be taking a mutex.
>>>> Luckily all cpufreq drivers at the moment only want to see if OPP
>>>> table is empty or not, so as a stop-gap we can take rcu_lock
>>>> automatically as we are getting count. We won't get necessarily
>>>> accurate result, but at least we will be safe traversing the list.
>>>
>>> So, instead of a half solution, lets consider this in the realm of
>>> dynamic OPPs as well. agreed to the point that we only have safe
>>> traversal and pointer validity. the real problem however is with
>>> "dynamic OPPs" (one of the original reasons why i did not add dynamic
>>> OPPs in the original version was to escape from it's complexity for
>>> users - anyways.. we are beyond that now). if OPPs can be removed on
>>> the fly, we need the following:
>>> a) use OPP notifiers to adequately handle list modification
>>> b) lock down list modification (and associated APIs) to ensure that
>>> the original cpufreq /devfreq list is correct.
>>>
>>> I still dont see the need to do this half solution.
>>
>> The need for half solution at the moment is that you can't safely
>> travel the lists and may crash on an invalid pointer.
>
> So, fix the cpufreq-dt instead of moving the hack inside OPP driver.

I started there, but it is not only cpufreq-dt that got it wrong. I
considered changing individual drivers (Viresh also suggested adding
_locked() variant API), but decided patching opp was less invasive for
now.

>
>>
>> Going forward I think (I mentioned that in my other email) that we
>> should rework the OPP API so that callers fetch OPP table object for a
>> device at init/probe time and then use it to get OPPs. This way won't
>> have to travel two lists any time we want to reference an OPP.
>>
>> And instead of relying notifiers, maybe look into using OPP tables
>> directly in cpufreq drivers instead of converting OPP into static-ish
>> cpufreq tables.
>>
>
> If you'd like a proper fix for OPP usage, I am all open to see such a
> proposal that works not just for cpufreq, but also for devfreq as well.

Yeah, let's see what kind of time I have ;)

Thanks.

--
Dmitry

2014-12-24 20:37:35

by Nishanth Menon

[permalink] [raw]
Subject: Re: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

On 12/24/2014 11:44 AM, Dmitry Torokhov wrote:
> On Wed, Dec 24, 2014 at 9:37 AM, Nishanth Menon <[email protected]> wrote:
>> On 12/24/2014 11:31 AM, Dmitry Torokhov wrote:
>>> On Wed, Dec 24, 2014 at 9:16 AM, Nishanth Menon <[email protected]> wrote:
>>>> On 12/24/2014 11:09 AM, Dmitry Torokhov wrote:
>>>>> On Wed, Dec 24, 2014 at 8:48 AM, Nishanth Menon <[email protected]> wrote:
>>>>>> On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
>>>>>>> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
>>>>>>> needs to be called under RCU lock. Given that RCU locks can safely be
>>>>>>> nested, instead of providing *_locked() API, let's take RCU lock inside
>>>>>>> dev_pm_opp_get_opp_count() and leave callers as is.
>>>>>>
>>>>>> While it is true that we can safely do nested RCU locks, This also
>>>>>> encourages wrong usage.
>>>>>>
>>>>>> count = dev_pm_opp_get_opp_count(dev)
>>>>>> ^^ point A
>>>>>> array = kzalloc(count * sizeof (*array));
>>>>>> rcu_read_lock();
>>>>>> ^^ point B
>>>>>> .. work down the list and add OPPs..
>>>>>> ...
>>>>>>
>>>>>> Between A and B, we might have had list modification (dynamic OPP
>>>>>> addition or deletion) - which implies that the count is no longer
>>>>>> accurate between point A and B. instead, enforcing callers to have the
>>>>>> responsibility of rcu_lock is exactly what we have to do since the OPP
>>>>>> library has no clue how to enforce pointer or data accuracy.
>>>>>
>>>>> No, you seem to have a misconception that rcu_lock protects you past
>>>>> the point B, but that is also wrong. The only thing rcu "lock"
>>>>> provides is safe traversing the list and guarantee that elements will
>>>>> not disappear while you are referencing them, but list can both
>>>>> contract and expand under you. In that regard code in
>>>>> drivers/cpufreq/cpufreq_opp.c is utterly wrong. If you want to count
>>>>> the list and use number of elements you should be taking a mutex.
>>>>> Luckily all cpufreq drivers at the moment only want to see if OPP
>>>>> table is empty or not, so as a stop-gap we can take rcu_lock
>>>>> automatically as we are getting count. We won't get necessarily
>>>>> accurate result, but at least we will be safe traversing the list.
>>>>
>>>> So, instead of a half solution, lets consider this in the realm of
>>>> dynamic OPPs as well. agreed to the point that we only have safe
>>>> traversal and pointer validity. the real problem however is with
>>>> "dynamic OPPs" (one of the original reasons why i did not add dynamic
>>>> OPPs in the original version was to escape from it's complexity for
>>>> users - anyways.. we are beyond that now). if OPPs can be removed on
>>>> the fly, we need the following:
>>>> a) use OPP notifiers to adequately handle list modification
>>>> b) lock down list modification (and associated APIs) to ensure that
>>>> the original cpufreq /devfreq list is correct.
>>>>
>>>> I still dont see the need to do this half solution.
>>>
>>> The need for half solution at the moment is that you can't safely
>>> travel the lists and may crash on an invalid pointer.
>>
>> So, fix the cpufreq-dt instead of moving the hack inside OPP driver.
>
> I started there, but it is not only cpufreq-dt that got it wrong. I
> considered changing individual drivers (Viresh also suggested adding
> _locked() variant API), but decided patching opp was less invasive for
> now.

True. I had done an audit and cleanup, I think a couple or so years
back and things ofcourse tend to go down the bitrot path without
constant checkups :(

>>> Going forward I think (I mentioned that in my other email) that we
>>> should rework the OPP API so that callers fetch OPP table object for a
>>> device at init/probe time and then use it to get OPPs. This way won't
>>> have to travel two lists any time we want to reference an OPP.
>>>
>>> And instead of relying notifiers, maybe look into using OPP tables
>>> directly in cpufreq drivers instead of converting OPP into static-ish
>>> cpufreq tables.
>>>
>>
>> If you'd like a proper fix for OPP usage, I am all open to see such a
>> proposal that works not just for cpufreq, but also for devfreq as well.
>
> Yeah, let's see what kind of time I have ;)

That would be nice. Thank you.

if you could post the split off the remaining patches from the series
esp patches #1 and #2 w.r.t 3.19-rc1 and repost, it will be nice to
get them merged in as they do look like obvious improvements good to
get in without depending on the remainder of the series which we can
work on meanwhile.

--
Regards,
Nishanth Menon

2014-12-27 20:12:39

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

On Wednesday, December 24, 2014 02:37:14 PM Nishanth Menon wrote:
> On 12/24/2014 11:44 AM, Dmitry Torokhov wrote:
> > On Wed, Dec 24, 2014 at 9:37 AM, Nishanth Menon <[email protected]> wrote:
> >> On 12/24/2014 11:31 AM, Dmitry Torokhov wrote:
> >>> On Wed, Dec 24, 2014 at 9:16 AM, Nishanth Menon <[email protected]> wrote:
> >>>> On 12/24/2014 11:09 AM, Dmitry Torokhov wrote:
> >>>>> On Wed, Dec 24, 2014 at 8:48 AM, Nishanth Menon <[email protected]> wrote:
> >>>>>> On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
> >>>>>>> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
> >>>>>>> needs to be called under RCU lock. Given that RCU locks can safely be
> >>>>>>> nested, instead of providing *_locked() API, let's take RCU lock inside
> >>>>>>> dev_pm_opp_get_opp_count() and leave callers as is.
> >>>>>>
> >>>>>> While it is true that we can safely do nested RCU locks, This also
> >>>>>> encourages wrong usage.
> >>>>>>
> >>>>>> count = dev_pm_opp_get_opp_count(dev)
> >>>>>> ^^ point A
> >>>>>> array = kzalloc(count * sizeof (*array));
> >>>>>> rcu_read_lock();
> >>>>>> ^^ point B
> >>>>>> .. work down the list and add OPPs..
> >>>>>> ...
> >>>>>>
> >>>>>> Between A and B, we might have had list modification (dynamic OPP
> >>>>>> addition or deletion) - which implies that the count is no longer
> >>>>>> accurate between point A and B. instead, enforcing callers to have the
> >>>>>> responsibility of rcu_lock is exactly what we have to do since the OPP
> >>>>>> library has no clue how to enforce pointer or data accuracy.
> >>>>>
> >>>>> No, you seem to have a misconception that rcu_lock protects you past
> >>>>> the point B, but that is also wrong. The only thing rcu "lock"
> >>>>> provides is safe traversing the list and guarantee that elements will
> >>>>> not disappear while you are referencing them, but list can both
> >>>>> contract and expand under you. In that regard code in
> >>>>> drivers/cpufreq/cpufreq_opp.c is utterly wrong. If you want to count
> >>>>> the list and use number of elements you should be taking a mutex.
> >>>>> Luckily all cpufreq drivers at the moment only want to see if OPP
> >>>>> table is empty or not, so as a stop-gap we can take rcu_lock
> >>>>> automatically as we are getting count. We won't get necessarily
> >>>>> accurate result, but at least we will be safe traversing the list.
> >>>>
> >>>> So, instead of a half solution, lets consider this in the realm of
> >>>> dynamic OPPs as well. agreed to the point that we only have safe
> >>>> traversal and pointer validity. the real problem however is with
> >>>> "dynamic OPPs" (one of the original reasons why i did not add dynamic
> >>>> OPPs in the original version was to escape from it's complexity for
> >>>> users - anyways.. we are beyond that now). if OPPs can be removed on
> >>>> the fly, we need the following:
> >>>> a) use OPP notifiers to adequately handle list modification
> >>>> b) lock down list modification (and associated APIs) to ensure that
> >>>> the original cpufreq /devfreq list is correct.
> >>>>
> >>>> I still dont see the need to do this half solution.
> >>>
> >>> The need for half solution at the moment is that you can't safely
> >>> travel the lists and may crash on an invalid pointer.
> >>
> >> So, fix the cpufreq-dt instead of moving the hack inside OPP driver.
> >
> > I started there, but it is not only cpufreq-dt that got it wrong. I
> > considered changing individual drivers (Viresh also suggested adding
> > _locked() variant API), but decided patching opp was less invasive for
> > now.
>
> True. I had done an audit and cleanup, I think a couple or so years
> back and things ofcourse tend to go down the bitrot path without
> constant checkups :(
>
> >>> Going forward I think (I mentioned that in my other email) that we
> >>> should rework the OPP API so that callers fetch OPP table object for a
> >>> device at init/probe time and then use it to get OPPs. This way won't
> >>> have to travel two lists any time we want to reference an OPP.
> >>>
> >>> And instead of relying notifiers, maybe look into using OPP tables
> >>> directly in cpufreq drivers instead of converting OPP into static-ish
> >>> cpufreq tables.
> >>>
> >>
> >> If you'd like a proper fix for OPP usage, I am all open to see such a
> >> proposal that works not just for cpufreq, but also for devfreq as well.
> >
> > Yeah, let's see what kind of time I have ;)
>
> That would be nice. Thank you.
>
> if you could post the split off the remaining patches from the series
> esp patches #1 and #2 w.r.t 3.19-rc1 and repost, it will be nice to
> get them merged in as they do look like obvious improvements good to
> get in without depending on the remainder of the series which we can
> work on meanwhile.

I actually have the entire Dmitry's series queued up for the next push
(most likely on Monday) and at this point I'm not seeing a compelling
reason for not pushing it.


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.