This patch set changes workqueue related features in devfreq framework.
First patch switches to delayed work instead of deferred.
The second switches to regular system work and deletes custom 'devfreq'.
Using deferred work in this context might harm the system performance.
When the CPU enters idle, deferred work is not fired. The devfreq device's
utilization does not have to be connected with a particular CPU.
The drivers for GPUs, Network on Chip, cache L3 rely on devfreq governor.
They all are missing opportunity to check the HW state and react when
the deferred work is not fired.
A corner test case, when Dynamic Memory Controller is utilized by CPUs running
on full speed, might show x5 worse performance if the crucial CPU is in idle.
There was a discussion regarding v2 that the power usage because of waking up
an idle CPU would be too high. In my opinion it won't and fixing bug is more
important than < 1% more power used [1].
I have addressed also this issue. In this patch set there is a mechanism
which prevents from to frequent checks when there is no need.
When the device enters its lowest state (OPP) the framework sets polling
interval to 'polling_idle_ms'. In default Kconfig it is 500ms, so 2 times per
second.
It is tunable from the sysfs interface per device, thus driver developer can
experiment and choose best intervals for the system.
Changes:
v3:
- reordered first two patches
- added functionality to lower power consumption when the device is less busy;
there is a new polling interval enabled when device enters lowest frequency;
- added trace events to capture the behaviour of the system
v2:
- single patch split into two
- added cover letter
link for the previous version and discussion:
https://marc.info/?l=linux-pm&m=154989907416072&w=2
https://marc.info/?l=linux-pm&m=154904631226997&w=2
Regards,
Lukasz Luba
[1] https://marc.info/?l=linux-kernel&m=155000641622443&w=2
Lukasz Luba (7):
drivers: devfreq: change deferred work into delayed
drivers: devfreq: change devfreq workqueue mechanism
Kconfig: drivers: devfreq: add default idle polling
include: devfreq: add polling_idle_ms to 'profile'
drivers: devfreq: add longer polling interval in idle
trace: events: add devfreq trace event file
drivers: devfreq: add tracing for scheduling work
MAINTAINERS | 1 +
drivers/devfreq/Kconfig | 13 +++
drivers/devfreq/devfreq.c | 184 ++++++++++++++++++++++++------
drivers/devfreq/governor.h | 3 +-
drivers/devfreq/governor_simpleondemand.c | 6 +-
include/linux/devfreq.h | 6 +
include/trace/events/devfreq.h | 39 +++++++
7 files changed, 218 insertions(+), 34 deletions(-)
create mode 100644 include/trace/events/devfreq.h
--
2.7.4
There is no need for creating another workqueue in the system,
the existing one should meet the requirements.
This patch removes devfreq's custom workqueue and uses system one.
It switches from queue_delayed_work() to schedule_delayed_work().
It also does not wake up the system when it enters suspend (this
functionality stays the same).
Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/devfreq/devfreq.c | 25 ++++++-------------------
1 file changed, 6 insertions(+), 19 deletions(-)
diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
index 0c9bff8..c200b3c 100644
--- a/drivers/devfreq/devfreq.c
+++ b/drivers/devfreq/devfreq.c
@@ -31,13 +31,6 @@
static struct class *devfreq_class;
-/*
- * devfreq core provides delayed work based load monitoring helper
- * functions. Governors can use these or can implement their own
- * monitoring mechanism.
- */
-static struct workqueue_struct *devfreq_wq;
-
/* The list of all device-devfreq governors */
static LIST_HEAD(devfreq_governor_list);
/* The list of all device-devfreq */
@@ -391,8 +384,8 @@ static void devfreq_monitor(struct work_struct *work)
if (err)
dev_err(&devfreq->dev, "dvfs failed with (%d) error\n", err);
- queue_delayed_work(devfreq_wq, &devfreq->work,
- msecs_to_jiffies(devfreq->profile->polling_ms));
+ schedule_delayed_work(&devfreq->work,
+ msecs_to_jiffies(devfreq->profile->polling_ms));
mutex_unlock(&devfreq->lock);
}
@@ -409,7 +402,7 @@ void devfreq_monitor_start(struct devfreq *devfreq)
{
INIT_DELAYED_WORK(&devfreq->work, devfreq_monitor);
if (devfreq->profile->polling_ms)
- queue_delayed_work(devfreq_wq, &devfreq->work,
+ schedule_delayed_work(&devfreq->work,
msecs_to_jiffies(devfreq->profile->polling_ms));
}
EXPORT_SYMBOL(devfreq_monitor_start);
@@ -473,7 +466,7 @@ void devfreq_monitor_resume(struct devfreq *devfreq)
if (!delayed_work_pending(&devfreq->work) &&
devfreq->profile->polling_ms)
- queue_delayed_work(devfreq_wq, &devfreq->work,
+ schedule_delayed_work(&devfreq->work,
msecs_to_jiffies(devfreq->profile->polling_ms));
devfreq->last_stat_updated = jiffies;
@@ -516,7 +509,7 @@ void devfreq_interval_update(struct devfreq *devfreq, unsigned int *delay)
/* if current delay is zero, start polling with new delay */
if (!cur_delay) {
- queue_delayed_work(devfreq_wq, &devfreq->work,
+ schedule_delayed_work(&devfreq->work,
msecs_to_jiffies(devfreq->profile->polling_ms));
goto out;
}
@@ -527,7 +520,7 @@ void devfreq_interval_update(struct devfreq *devfreq, unsigned int *delay)
cancel_delayed_work_sync(&devfreq->work);
mutex_lock(&devfreq->lock);
if (!devfreq->stop_polling)
- queue_delayed_work(devfreq_wq, &devfreq->work,
+ schedule_delayed_work(&devfreq->work,
msecs_to_jiffies(devfreq->profile->polling_ms));
}
out:
@@ -1430,12 +1423,6 @@ static int __init devfreq_init(void)
return PTR_ERR(devfreq_class);
}
- devfreq_wq = create_freezable_workqueue("devfreq_wq");
- if (!devfreq_wq) {
- class_destroy(devfreq_class);
- pr_err("%s: couldn't create workqueue\n", __FILE__);
- return -ENOMEM;
- }
devfreq_class->dev_groups = devfreq_groups;
return 0;
--
2.7.4
Add needed fields to support new state: idle, where different polling
interval is in use. It provides better control of the devfreq device
and lower the power consumption when the device is not busy.
Signed-off-by: Lukasz Luba <[email protected]>
---
include/linux/devfreq.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
index fbffa74..5140970 100644
--- a/include/linux/devfreq.h
+++ b/include/linux/devfreq.h
@@ -72,6 +72,11 @@ struct devfreq_dev_status {
* @initial_freq: The operating frequency when devfreq_add_device() is
* called.
* @polling_ms: The polling interval in ms. 0 disables polling.
+ * @polling_idle_ms: The polling interval in 'idle state' in ms.
+ * When the device is running at lowest frequency and has
+ * low-load, it is considered as in 'idle state'.
+ * Thus, longer polling interval is used for the device
+ * to save some power.
* @target: The device should set its operating frequency at
* freq or lowest-upper-than-freq value. If freq is
* higher than any operable frequency, set maximum.
@@ -98,6 +103,7 @@ struct devfreq_dev_status {
struct devfreq_dev_profile {
unsigned long initial_freq;
unsigned int polling_ms;
+ unsigned int polling_idle_ms;
int (*target)(struct device *dev, unsigned long *freq, u32 flags);
int (*get_dev_status)(struct device *dev,
--
2.7.4
This patch add basic tracing of the devfreq workqueue and delayed work.
It aims to capture changes of the polling intervals and device state.
Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/devfreq/devfreq.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
index 29e99ce..c1d0d8c 100644
--- a/drivers/devfreq/devfreq.c
+++ b/drivers/devfreq/devfreq.c
@@ -29,6 +29,9 @@
#include <linux/of.h>
#include "governor.h"
+#define CREATE_TRACE_POINTS
+#include <trace/events/devfreq.h>
+
/* The ~30% load threshold used for load calculation (due to fixed point
* arithmetic) */
#define LOAD_THRESHOLD_IN_DEVICE_USAGE (300)
@@ -418,6 +421,7 @@ static void devfreq_monitor(struct work_struct *work)
struct devfreq *devfreq = container_of(work,
struct devfreq, work.work);
unsigned int polling_ms;
+ const char *df_name = dev_name(&devfreq->dev);
mutex_lock(&devfreq->lock);
polling_ms = devfreq_get_polling_delay(devfreq);
@@ -429,6 +433,10 @@ static void devfreq_monitor(struct work_struct *work)
schedule_delayed_work(&devfreq->work,
msecs_to_jiffies(polling_ms));
mutex_unlock(&devfreq->lock);
+
+ trace_devfreq_monitor(df_name, devfreq->previous_freq, polling_ms,
+ devfreq->last_status.busy_time,
+ devfreq->last_status.total_time);
}
/**
--
2.7.4
This patch adds default idle polling value for devfreq devices.
It is used in idle state, when the device is running at lowest frequency.
Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/devfreq/Kconfig | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/drivers/devfreq/Kconfig b/drivers/devfreq/Kconfig
index 6a172d3..f0fcc5e 100644
--- a/drivers/devfreq/Kconfig
+++ b/drivers/devfreq/Kconfig
@@ -29,6 +29,19 @@ menuconfig PM_DEVFREQ
if PM_DEVFREQ
+config DEVFREQ_DEFAULT_POLLING_IDLE_MS
+ int "Default polling interval in idle state (ms)"
+ default 500
+ help
+ The devfreq device is being polled by the devfreq framework.
+ There are two states for the device operation: normal and idle.
+ Each state has corresponding polling interval.
+ When a device is not heavily loaded and runs at lowest frequency,
+ it is in 'idle' state and longer polling interval is used.
+ This setting controls the idle polling interval.
+ It is designed to lower for power consumption.
+ Set this to non-zero if the default value is to big for your system.
+
comment "DEVFREQ Governors"
config DEVFREQ_GOV_SIMPLE_ONDEMAND
--
2.7.4
This patch changes deferred work to delayed work, which is now not missed
when timer is put on CPU that entered idle state.
The devfreq framework governor was not called, thus changing the device's
frequency did not happen.
Benchmarks for stressing Dynamic Memory Controller show x2 (in edge cases
even x5) performance boost with this patch when 'simpleondemand_governor'
is responsible for monitoring the device load and frequency changes.
With this patch, the delayed work is done no mater CPUs' idle.
All of the drivers in devfreq which rely on periodic, guaranteed wakeup
intervals should benefit from it.
Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/devfreq/devfreq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
index 0ae3de7..0c9bff8 100644
--- a/drivers/devfreq/devfreq.c
+++ b/drivers/devfreq/devfreq.c
@@ -407,7 +407,7 @@ static void devfreq_monitor(struct work_struct *work)
*/
void devfreq_monitor_start(struct devfreq *devfreq)
{
- INIT_DEFERRABLE_WORK(&devfreq->work, devfreq_monitor);
+ INIT_DELAYED_WORK(&devfreq->work, devfreq_monitor);
if (devfreq->profile->polling_ms)
queue_delayed_work(devfreq_wq, &devfreq->work,
msecs_to_jiffies(devfreq->profile->polling_ms));
--
2.7.4
The patch adds a new file for with trace events for devfreq
framework. They are used for performance analysis of the framework.
It also contains updates in MAINTAINERS file adding new entry for
devfreq maintainers.
Signed-off-by: Lukasz Luba <[email protected]>
---
MAINTAINERS | 1 +
include/trace/events/devfreq.h | 39 +++++++++++++++++++++++++++++++++++++++
2 files changed, 40 insertions(+)
create mode 100644 include/trace/events/devfreq.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 9919840..c042fda 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4447,6 +4447,7 @@ S: Maintained
F: drivers/devfreq/
F: include/linux/devfreq.h
F: Documentation/devicetree/bindings/devfreq/
+F: include/trace/events/devfreq.h
DEVICE FREQUENCY EVENT (DEVFREQ-EVENT)
M: Chanwoo Choi <[email protected]>
diff --git a/include/trace/events/devfreq.h b/include/trace/events/devfreq.h
new file mode 100644
index 0000000..fec9304
--- /dev/null
+++ b/include/trace/events/devfreq.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM devfreq
+
+#if !defined(_TRACE_DEVFREQ_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DEVFREQ_H
+
+#include <linux/devfreq.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(devfreq_monitor,
+ TP_PROTO(const char *dev_name, unsigned long freq,
+ unsigned int polling_ms, unsigned long busy_time,
+ unsigned long total_time),
+
+ TP_ARGS(dev_name, freq, polling_ms, busy_time, total_time),
+
+ TP_STRUCT__entry(
+ __string(dev_name, dev_name)
+ __field(unsigned long, freq)
+ __field(unsigned int, polling_ms)
+ __field(unsigned int, load)
+ ),
+
+ TP_fast_assign(
+ __assign_str(dev_name, dev_name);
+ __entry->freq = freq;
+ __entry->polling_ms = polling_ms;
+ __entry->load = (100 * busy_time) / total_time;
+ ),
+
+ TP_printk("dev_name=%s freq=%lu polling_ms=%u load=%u",
+ __get_str(dev_name), __entry->freq, __entry->polling_ms,
+ __entry->load)
+);
+#endif /* _TRACE_DEVFREQ_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--
2.7.4
This patch adds new mechanism for devfreq devices which changes polling
interval. The system should sleep longer when the devfreq device is almost
not used. The devfreq framework will not schedule the work too often.
This low-load state is recognised when the device is operating at the lowest
frequency and has low busy_time/total_time factor (< 30%).
When the frequency is different then min, the device is under normal polling
which is the value defined in driver's 'polling_ms'.
When the device is getting more pressure, the framework is able to catch it
based on 'load' in lowest frequency and will start polling more frequently.
The second scenario is when the governor recognised heavy load at minimum
frequency and increases the frequency. The devfreq framework will start
polling in shorter intervals.
The polling interval, when the device is not heavily, can also be changed
from userspace of defined by the driver's author.
Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/devfreq/devfreq.c | 151 +++++++++++++++++++++++++++---
drivers/devfreq/governor.h | 3 +-
drivers/devfreq/governor_simpleondemand.c | 6 +-
3 files changed, 145 insertions(+), 15 deletions(-)
diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
index c200b3c..29e99ce 100644
--- a/drivers/devfreq/devfreq.c
+++ b/drivers/devfreq/devfreq.c
@@ -29,6 +29,13 @@
#include <linux/of.h>
#include "governor.h"
+/* The ~30% load threshold used for load calculation (due to fixed point
+ * arithmetic) */
+#define LOAD_THRESHOLD_IN_DEVICE_USAGE (300)
+
+static const
+unsigned int default_polling_idle_ms = CONFIG_DEVFREQ_DEFAULT_POLLING_IDLE_MS;
+
static struct class *devfreq_class;
/* The list of all device-devfreq governors */
@@ -144,6 +151,38 @@ static int set_freq_table(struct devfreq *devfreq)
}
/**
+ * devfreq_get_polling_delay() - gets the polling delay for current state
+ * @devfreq: the devfreq instance.
+ *
+ * Helper function which checks existing device state and returns polling
+ * interval. The function requires the caller holds devfreq->lock.
+ */
+static int devfreq_get_polling_delay(struct devfreq *devfreq)
+{
+ unsigned int scaling_min_freq;
+ unsigned long load;
+
+ lockdep_assert_held(&devfreq->lock);
+
+ /* Check the frequency of the device. If it not the lowest then use
+ * device's polling_ms interval and job is done. */
+ scaling_min_freq = max(devfreq->scaling_min_freq, devfreq->min_freq);
+ if (scaling_min_freq != devfreq->previous_freq)
+ return devfreq->profile->polling_ms;
+
+ /* The device is running minimum frequency, check the load and if
+ * the value crosses the threshold, start polling with device's
+ * polling_ms value. */
+ load = devfreq->last_status.busy_time << 10;
+ load /= devfreq->last_status.total_time;
+
+ if (load > LOAD_THRESHOLD_IN_DEVICE_USAGE)
+ return devfreq->profile->polling_ms;
+ else
+ return devfreq->profile->polling_idle_ms;
+}
+
+/**
* devfreq_update_status() - Update statistics of devfreq behavior
* @devfreq: the devfreq instance
* @freq: the update target frequency
@@ -378,14 +417,17 @@ static void devfreq_monitor(struct work_struct *work)
int err;
struct devfreq *devfreq = container_of(work,
struct devfreq, work.work);
+ unsigned int polling_ms;
mutex_lock(&devfreq->lock);
+ polling_ms = devfreq_get_polling_delay(devfreq);
+
err = update_devfreq(devfreq);
if (err)
dev_err(&devfreq->dev, "dvfs failed with (%d) error\n", err);
schedule_delayed_work(&devfreq->work,
- msecs_to_jiffies(devfreq->profile->polling_ms));
+ msecs_to_jiffies(polling_ms));
mutex_unlock(&devfreq->lock);
}
@@ -401,6 +443,7 @@ static void devfreq_monitor(struct work_struct *work)
void devfreq_monitor_start(struct devfreq *devfreq)
{
INIT_DELAYED_WORK(&devfreq->work, devfreq_monitor);
+ /* Start polling with normal (not idle) polling interval. */
if (devfreq->profile->polling_ms)
schedule_delayed_work(&devfreq->work,
msecs_to_jiffies(devfreq->profile->polling_ms));
@@ -464,6 +507,7 @@ void devfreq_monitor_resume(struct devfreq *devfreq)
if (!devfreq->stop_polling)
goto out;
+ /* In resume, normal (not idle) polling interval is used. */
if (!delayed_work_pending(&devfreq->work) &&
devfreq->profile->polling_ms)
schedule_delayed_work(&devfreq->work,
@@ -485,43 +529,60 @@ EXPORT_SYMBOL(devfreq_monitor_resume);
* devfreq_interval_update() - Update device devfreq monitoring interval
* @devfreq: the devfreq instance.
* @delay: new polling interval to be set.
+ * @idle: indicates state for which the new interval is going to be set.
*
* Helper function to set new load monitoring polling interval. Function
- * to be called from governor in response to DEVFREQ_GOV_INTERVAL event.
+ * to be called from governor in response to DEVFREQ_GOV_INTERVAL or
+ * DEVFREQ_GOV_IDLE_INTERVAL event.
*/
-void devfreq_interval_update(struct devfreq *devfreq, unsigned int *delay)
+void devfreq_interval_update(struct devfreq *devfreq, unsigned int *delay,
+ bool idle)
{
- unsigned int cur_delay = devfreq->profile->polling_ms;
+ unsigned int cur_delay;
unsigned int new_delay = *delay;
+ bool dev_in_idle = false;
mutex_lock(&devfreq->lock);
- devfreq->profile->polling_ms = new_delay;
+ cur_delay = devfreq_get_polling_delay(devfreq);
+ /* check if we are currently in idle state, it will be needed later */
+ if (cur_delay == devfreq->profile->polling_idle_ms)
+ dev_in_idle = true;
+
+ if (idle)
+ devfreq->profile->polling_idle_ms = new_delay;
+ else
+ devfreq->profile->polling_ms = new_delay;
+
+ /* device is in suspend, does not need to do anything more. */
if (devfreq->stop_polling)
goto out;
- /* if new delay is zero, stop polling */
- if (!new_delay) {
+ /* if new delay is zero and it is for 'normal' polling,
+ * then stop polling */
+ if (!new_delay && !idle) {
mutex_unlock(&devfreq->lock);
cancel_delayed_work_sync(&devfreq->work);
return;
}
- /* if current delay is zero, start polling with new delay */
- if (!cur_delay) {
+ /* if current delay is zero and it is not for idle,
+ * start polling with 'normal' polling interval */
+ if (!cur_delay && !idle) {
schedule_delayed_work(&devfreq->work,
- msecs_to_jiffies(devfreq->profile->polling_ms));
+ msecs_to_jiffies(new_delay));
goto out;
}
- /* if current delay is greater than new delay, restart polling */
- if (cur_delay > new_delay) {
+ /* if current delay is greater than new delay and the new polling value
+ * corresponds to the current state, restart polling */
+ if (cur_delay > new_delay && dev_in_idle == idle) {
mutex_unlock(&devfreq->lock);
cancel_delayed_work_sync(&devfreq->work);
mutex_lock(&devfreq->lock);
if (!devfreq->stop_polling)
schedule_delayed_work(&devfreq->work,
- msecs_to_jiffies(devfreq->profile->polling_ms));
+ msecs_to_jiffies(new_delay));
}
out:
mutex_unlock(&devfreq->lock);
@@ -590,6 +651,24 @@ static void devfreq_dev_release(struct device *dev)
}
/**
+ * polling_idle_init() - Initialize polling interval for device's low-load.
+ * @df: the devfreq device which is setup
+ *
+ * The function checks if the driver's code defined the 'polling_idle_ms' and
+ * leaves it or tries to initialise according to the framework's default value
+ * and 'normal' polling interval ('polling_ms').
+ */
+static void polling_idle_init(struct devfreq *df)
+{
+ if (!df->profile->polling_idle_ms)
+ df->profile->polling_idle_ms = default_polling_idle_ms;
+
+ if (df->profile->polling_idle_ms <= df->profile->polling_ms)
+ df->profile->polling_idle_ms = df->profile->polling_ms +
+ default_polling_idle_ms;
+}
+
+/**
* devfreq_add_device() - Add devfreq feature to the device
* @dev: the device to add devfreq feature.
* @profile: device-specific profile to run devfreq.
@@ -664,6 +743,8 @@ struct devfreq *devfreq_add_device(struct device *dev,
}
devfreq->max_freq = devfreq->scaling_max_freq;
+ polling_idle_init(devfreq);
+
devfreq->suspend_freq = dev_pm_opp_get_suspend_opp_freq(dev);
atomic_set(&devfreq->suspend_count, 0);
@@ -1235,6 +1316,13 @@ static ssize_t polling_interval_store(struct device *dev,
if (ret != 1)
return -EINVAL;
+ mutex_lock(&df->lock);
+ if (df->profile->polling_idle_ms < value) {
+ mutex_unlock(&df->lock);
+ return -EINVAL;
+ }
+ mutex_unlock(&df->lock);
+
df->governor->event_handler(df, DEVFREQ_GOV_INTERVAL, &value);
ret = count;
@@ -1242,6 +1330,42 @@ static ssize_t polling_interval_store(struct device *dev,
}
static DEVICE_ATTR_RW(polling_interval);
+static ssize_t
+polling_idle_interval_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%d\n", to_devfreq(dev)->profile->polling_idle_ms);
+}
+
+static ssize_t polling_idle_interval_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct devfreq *df = to_devfreq(dev);
+ unsigned int value;
+ int ret;
+
+ if (!df->governor)
+ return -EINVAL;
+
+ ret = sscanf(buf, "%u", &value);
+ if (ret != 1)
+ return -EINVAL;
+
+ mutex_lock(&df->lock);
+ if (df->profile->polling_ms > value) {
+ mutex_unlock(&df->lock);
+ return -EINVAL;
+ }
+ mutex_unlock(&df->lock);
+
+ df->governor->event_handler(df, DEVFREQ_GOV_IDLE_INTERVAL, &value);
+ ret = count;
+
+ return ret;
+}
+static DEVICE_ATTR_RW(polling_idle_interval);
+
static ssize_t min_freq_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
{
@@ -1408,6 +1532,7 @@ static struct attribute *devfreq_attrs[] = {
&dev_attr_available_frequencies.attr,
&dev_attr_target_freq.attr,
&dev_attr_polling_interval.attr,
+ &dev_attr_polling_idle_interval.attr,
&dev_attr_min_freq.attr,
&dev_attr_max_freq.attr,
&dev_attr_trans_stat.attr,
diff --git a/drivers/devfreq/governor.h b/drivers/devfreq/governor.h
index f53339c..fa4f0c5 100644
--- a/drivers/devfreq/governor.h
+++ b/drivers/devfreq/governor.h
@@ -24,6 +24,7 @@
#define DEVFREQ_GOV_INTERVAL 0x3
#define DEVFREQ_GOV_SUSPEND 0x4
#define DEVFREQ_GOV_RESUME 0x5
+#define DEVFREQ_GOV_IDLE_INTERVAL 0x6
#define DEVFREQ_MIN_FREQ 0
#define DEVFREQ_MAX_FREQ ULONG_MAX
@@ -62,7 +63,7 @@ extern void devfreq_monitor_stop(struct devfreq *devfreq);
extern void devfreq_monitor_suspend(struct devfreq *devfreq);
extern void devfreq_monitor_resume(struct devfreq *devfreq);
extern void devfreq_interval_update(struct devfreq *devfreq,
- unsigned int *delay);
+ unsigned int *delay, bool idle);
extern int devfreq_add_governor(struct devfreq_governor *governor);
extern int devfreq_remove_governor(struct devfreq_governor *governor);
diff --git a/drivers/devfreq/governor_simpleondemand.c b/drivers/devfreq/governor_simpleondemand.c
index c0417f0..cc1feac 100644
--- a/drivers/devfreq/governor_simpleondemand.c
+++ b/drivers/devfreq/governor_simpleondemand.c
@@ -100,7 +100,11 @@ static int devfreq_simple_ondemand_handler(struct devfreq *devfreq,
break;
case DEVFREQ_GOV_INTERVAL:
- devfreq_interval_update(devfreq, (unsigned int *)data);
+ devfreq_interval_update(devfreq, (unsigned int *)data, false);
+ break;
+
+ case DEVFREQ_GOV_IDLE_INTERVAL:
+ devfreq_interval_update(devfreq, (unsigned int *)data, true);
break;
case DEVFREQ_GOV_SUSPEND:
--
2.7.4
On Tue, 12 Feb 2019 23:23:57 +0100
Lukasz Luba <[email protected]> wrote:
> The patch adds a new file for with trace events for devfreq
> framework. They are used for performance analysis of the framework.
> It also contains updates in MAINTAINERS file adding new entry for
> devfreq maintainers.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> MAINTAINERS | 1 +
> include/trace/events/devfreq.h | 39 +++++++++++++++++++++++++++++++++++++++
> 2 files changed, 40 insertions(+)
> create mode 100644 include/trace/events/devfreq.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9919840..c042fda 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -4447,6 +4447,7 @@ S: Maintained
> F: drivers/devfreq/
> F: include/linux/devfreq.h
> F: Documentation/devicetree/bindings/devfreq/
> +F: include/trace/events/devfreq.h
>
> DEVICE FREQUENCY EVENT (DEVFREQ-EVENT)
> M: Chanwoo Choi <[email protected]>
> diff --git a/include/trace/events/devfreq.h b/include/trace/events/devfreq.h
> new file mode 100644
> index 0000000..fec9304
> --- /dev/null
> +++ b/include/trace/events/devfreq.h
> @@ -0,0 +1,39 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM devfreq
> +
> +#if !defined(_TRACE_DEVFREQ_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_DEVFREQ_H
> +
> +#include <linux/devfreq.h>
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(devfreq_monitor,
> + TP_PROTO(const char *dev_name, unsigned long freq,
> + unsigned int polling_ms, unsigned long busy_time,
> + unsigned long total_time),
> +
> + TP_ARGS(dev_name, freq, polling_ms, busy_time, total_time),
> +
> + TP_STRUCT__entry(
> + __string(dev_name, dev_name)
> + __field(unsigned long, freq)
> + __field(unsigned int, polling_ms)
> + __field(unsigned int, load)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(dev_name, dev_name);
> + __entry->freq = freq;
> + __entry->polling_ms = polling_ms;
> + __entry->load = (100 * busy_time) / total_time;
How critical is the code that this trace event is called in. If it is
not that critical (it is a slow path), then this is fine, but if this
is not a slow path (something triggered 100 HZ or less), then I would
recommend moving the above calculation to TP_printk(). A divide is a
slow operation, and is best to do that in the post processing. The
current location does the divide at the time of the tracepoint is
called.
I would also have a check to make sure that total_time is not zero
here, otherwise that could be bad.
__entry->busy_time = busy_time;
__entry->total_time = total_time;
> + ),
> +
> + TP_printk("dev_name=%s freq=%lu polling_ms=%u load=%u",
> + __get_str(dev_name), __entry->freq, __entry->polling_ms,
__entry->total_time == 0 ? 100 :
__entry->busy_time / __entry->total_time)
-- Steve
> + __entry->load)
> +);
> +#endif /* _TRACE_DEVFREQ_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
On 19. 2. 13. 오전 7:23, Lukasz Luba wrote:
> This patch set changes workqueue related features in devfreq framework.
> First patch switches to delayed work instead of deferred.
> The second switches to regular system work and deletes custom 'devfreq'.
>
> Using deferred work in this context might harm the system performance.
> When the CPU enters idle, deferred work is not fired. The devfreq device's
> utilization does not have to be connected with a particular CPU.
> The drivers for GPUs, Network on Chip, cache L3 rely on devfreq governor.
> They all are missing opportunity to check the HW state and react when
> the deferred work is not fired.
> A corner test case, when Dynamic Memory Controller is utilized by CPUs running
> on full speed, might show x5 worse performance if the crucial CPU is in idle.
>
> There was a discussion regarding v2 that the power usage because of waking up
We have not yet finished the any discussion. We only just had a few reply
and you didn't wait for any reply from other. As I already commented
on exynos5422-dmc series, please don't send the next patch
without any final agreement.
In this series, patch1/patch2 are same at version 2 patchset.
Even if we need to discuss more them, you just send same patches
without any agreement among reviewers. At least, you have to wait the reply
instead of just sending the new patchset. It is basic review rule on mailing list.
> an idle CPU would be too high. In my opinion it won't and fixing bug is more
> important than < 1% more power used [1].
> I have addressed also this issue. In this patch set there is a mechanism
> which prevents from to frequent checks when there is no need.
> When the device enters its lowest state (OPP) the framework sets polling
> interval to 'polling_idle_ms'. In default Kconfig it is 500ms, so 2 times per
> second.
> It is tunable from the sysfs interface per device, thus driver developer can
> experiment and choose best intervals for the system.
>
> Changes:
> v3:
> - reordered first two patches
> - added functionality to lower power consumption when the device is less busy;
> there is a new polling interval enabled when device enters lowest frequency;
> - added trace events to capture the behaviour of the system
> v2:
> - single patch split into two
> - added cover letter
>
> link for the previous version and discussion:
> https://marc.info/?l=linux-pm&m=154989907416072&w=2
> https://marc.info/?l=linux-pm&m=154904631226997&w=2
>
> Regards,
> Lukasz Luba
>
> [1] https://marc.info/?l=linux-kernel&m=155000641622443&w=2
>
> Lukasz Luba (7):
> drivers: devfreq: change deferred work into delayed
> drivers: devfreq: change devfreq workqueue mechanism
> Kconfig: drivers: devfreq: add default idle polling
> include: devfreq: add polling_idle_ms to 'profile'
> drivers: devfreq: add longer polling interval in idle
> trace: events: add devfreq trace event file
> drivers: devfreq: add tracing for scheduling work
>
> MAINTAINERS | 1 +
> drivers/devfreq/Kconfig | 13 +++
> drivers/devfreq/devfreq.c | 184 ++++++++++++++++++++++++------
> drivers/devfreq/governor.h | 3 +-
> drivers/devfreq/governor_simpleondemand.c | 6 +-
> include/linux/devfreq.h | 6 +
> include/trace/events/devfreq.h | 39 +++++++
> 7 files changed, 218 insertions(+), 34 deletions(-)
> create mode 100644 include/trace/events/devfreq.h
>
--
Best Regards,
Chanwoo Choi
Samsung Electronics
Hi Chanwoo,
On 2/13/19 1:18 AM, Chanwoo Choi wrote:
> On 19. 2. 13. 오전 7:23, Lukasz Luba wrote:
>> This patch set changes workqueue related features in devfreq framework.
>> First patch switches to delayed work instead of deferred.
>> The second switches to regular system work and deletes custom 'devfreq'.
>>
>> Using deferred work in this context might harm the system performance.
>> When the CPU enters idle, deferred work is not fired. The devfreq device's
>> utilization does not have to be connected with a particular CPU.
>> The drivers for GPUs, Network on Chip, cache L3 rely on devfreq governor.
>> They all are missing opportunity to check the HW state and react when
>> the deferred work is not fired.
>> A corner test case, when Dynamic Memory Controller is utilized by CPUs running
>> on full speed, might show x5 worse performance if the crucial CPU is in idle.
>>
>> There was a discussion regarding v2 that the power usage because of waking up
>
> We have not yet finished the any discussion. We only just had a few reply
> and you didn't wait for any reply from other. As I already commented
> on exynos5422-dmc series, please don't send the next patch
> without any final agreement.
>
> In this series, patch1/patch2 are same at version 2 patchset.
> Even if we need to discuss more them, you just send same patches
> without any agreement among reviewers. At least, you have to wait the reply
> instead of just sending the new patchset. It is basic review rule on mailing list.
I think you are blocking the fixes.
Matthias is currently fixing devfreq governors which lack 'case SUSPEND'
and he is interested in these v3. I have fixed a month ago issue with
system suspend and OPP state of devfreq devices, which Tobias was rising
and sending patches a few years ago, but you have blocked them.
These version is ready for comments regarding 'battery powered devices'.
The v3 has introduced approach for 2 polling intervals similar to
thermal framework.
It also has trace events. The trace events are *really* important in
this discussion because we need some measurements.
You said:
'When release the real mobile product like galaxy phone,
it is very important issue to remove the unneeded wakeup on idle state.'
Was it a real reason for profiling the devfreq framework and core
workqueue mechanism? There are embedded devices which are not using
battery and probably developers were not aware because there was no
traces which could give any information about devfreq behavior.
Power is important, performance is important but relaying on randomness
is not the best solution. As I said earlier, your driver can stuck on
highest OPP waiting for next callback which will not come...
Regards,
Lukasz
>
>
>> an idle CPU would be too high. In my opinion it won't and fixing bug is more
>> important than < 1% more power used [1].
>> I have addressed also this issue. In this patch set there is a mechanism
>> which prevents from to frequent checks when there is no need.
>> When the device enters its lowest state (OPP) the framework sets polling
>> interval to 'polling_idle_ms'. In default Kconfig it is 500ms, so 2 times per
>> second.
>> It is tunable from the sysfs interface per device, thus driver developer can
>> experiment and choose best intervals for the system.
>>
>> Changes:
>> v3:
>> - reordered first two patches
>> - added functionality to lower power consumption when the device is less busy;
>> there is a new polling interval enabled when device enters lowest frequency;
>> - added trace events to capture the behaviour of the system
>> v2:
>> - single patch split into two
>> - added cover letter
>>
>> link for the previous version and discussion:
>> https://marc.info/?l=linux-pm&m=154989907416072&w=2
>> https://marc.info/?l=linux-pm&m=154904631226997&w=2
>>
>> Regards,
>> Lukasz Luba
>>
>> [1] https://marc.info/?l=linux-kernel&m=155000641622443&w=2
>>
>> Lukasz Luba (7):
>> drivers: devfreq: change deferred work into delayed
>> drivers: devfreq: change devfreq workqueue mechanism
>> Kconfig: drivers: devfreq: add default idle polling
>> include: devfreq: add polling_idle_ms to 'profile'
>> drivers: devfreq: add longer polling interval in idle
>> trace: events: add devfreq trace event file
>> drivers: devfreq: add tracing for scheduling work
>>
>> MAINTAINERS | 1 +
>> drivers/devfreq/Kconfig | 13 +++
>> drivers/devfreq/devfreq.c | 184 ++++++++++++++++++++++++------
>> drivers/devfreq/governor.h | 3 +-
>> drivers/devfreq/governor_simpleondemand.c | 6 +-
>> include/linux/devfreq.h | 6 +
>> include/trace/events/devfreq.h | 39 +++++++
>> 7 files changed, 218 insertions(+), 34 deletions(-)
>> create mode 100644 include/trace/events/devfreq.h
>>
>
>
Hi Steven,
On 2/13/19 12:14 AM, Steven Rostedt wrote:
> On Tue, 12 Feb 2019 23:23:57 +0100
> Lukasz Luba <[email protected]> wrote:
>
>> The patch adds a new file for with trace events for devfreq
>> framework. They are used for performance analysis of the framework.
>> It also contains updates in MAINTAINERS file adding new entry for
>> devfreq maintainers.
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>> MAINTAINERS | 1 +
>> include/trace/events/devfreq.h | 39 +++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 40 insertions(+)
>> create mode 100644 include/trace/events/devfreq.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 9919840..c042fda 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -4447,6 +4447,7 @@ S: Maintained
>> F: drivers/devfreq/
>> F: include/linux/devfreq.h
>> F: Documentation/devicetree/bindings/devfreq/
>> +F: include/trace/events/devfreq.h
>>
>> DEVICE FREQUENCY EVENT (DEVFREQ-EVENT)
>> M: Chanwoo Choi <[email protected]>
>> diff --git a/include/trace/events/devfreq.h b/include/trace/events/devfreq.h
>> new file mode 100644
>> index 0000000..fec9304
>> --- /dev/null
>> +++ b/include/trace/events/devfreq.h
>> @@ -0,0 +1,39 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM devfreq
>> +
>> +#if !defined(_TRACE_DEVFREQ_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_DEVFREQ_H
>> +
>> +#include <linux/devfreq.h>
>> +#include <linux/tracepoint.h>
>> +
>> +TRACE_EVENT(devfreq_monitor,
>> + TP_PROTO(const char *dev_name, unsigned long freq,
>> + unsigned int polling_ms, unsigned long busy_time,
>> + unsigned long total_time),
>> +
>> + TP_ARGS(dev_name, freq, polling_ms, busy_time, total_time),
>> +
>> + TP_STRUCT__entry(
>> + __string(dev_name, dev_name)
>> + __field(unsigned long, freq)
>> + __field(unsigned int, polling_ms)
>> + __field(unsigned int, load)
>> + ),
>> +
>> + TP_fast_assign(
>> + __assign_str(dev_name, dev_name);
>> + __entry->freq = freq;
>> + __entry->polling_ms = polling_ms;
>> + __entry->load = (100 * busy_time) / total_time;
>
> How critical is the code that this trace event is called in. If it is
> not that critical (it is a slow path), then this is fine, but if this
> is not a slow path (something triggered 100 HZ or less), then I would
> recommend moving the above calculation to TP_printk(). A divide is a
> slow operation, and is best to do that in the post processing. The
> current location does the divide at the time of the tracepoint is
> called.
I wasn't aware of these two stages, good to know.
I will move it to TP_printk().
>
> I would also have a check to make sure that total_time is not zero
> here, otherwise that could be bad.
>
> __entry->busy_time = busy_time;
> __entry->total_time = total_time;
>
>> + ),
>> +
>> + TP_printk("dev_name=%s freq=%lu polling_ms=%u load=%u",
>
>
>> + __get_str(dev_name), __entry->freq, __entry->polling_ms,
>
> __entry->total_time == 0 ? 100 :
> __entry->busy_time / __entry->total_time)
Thank you for the review.
I will add this check.
Regards,
Lukasz
>
> -- Steve
>
>> + __entry->load)
>> +);
>> +#endif /* _TRACE_DEVFREQ_H */
>> +
>> +/* This part must be outside protection */
>> +#include <trace/define_trace.h>
>
>
>
On Tue, 12 Feb 2019 23:23:57 +0100
Lukasz Luba <[email protected]> wrote:
> The patch adds a new file for with trace events for devfreq
> framework. They are used for performance analysis of the framework.
> It also contains updates in MAINTAINERS file adding new entry for
> devfreq maintainers.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> MAINTAINERS | 1 +
> include/trace/events/devfreq.h | 39 +++++++++++++++++++++++++++++++++++++++
> 2 files changed, 40 insertions(+)
> create mode 100644 include/trace/events/devfreq.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9919840..c042fda 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -4447,6 +4447,7 @@ S: Maintained
> F: drivers/devfreq/
> F: include/linux/devfreq.h
> F: Documentation/devicetree/bindings/devfreq/
> +F: include/trace/events/devfreq.h
>
> DEVICE FREQUENCY EVENT (DEVFREQ-EVENT)
> M: Chanwoo Choi <[email protected]>
> diff --git a/include/trace/events/devfreq.h b/include/trace/events/devfreq.h
> new file mode 100644
> index 0000000..fec9304
> --- /dev/null
> +++ b/include/trace/events/devfreq.h
> @@ -0,0 +1,39 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM devfreq
> +
> +#if !defined(_TRACE_DEVFREQ_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_DEVFREQ_H
> +
> +#include <linux/devfreq.h>
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(devfreq_monitor,
> + TP_PROTO(const char *dev_name, unsigned long freq,
> + unsigned int polling_ms, unsigned long busy_time,
> + unsigned long total_time),
> +
> + TP_ARGS(dev_name, freq, polling_ms, busy_time, total_time),
> +
> + TP_STRUCT__entry(
> + __string(dev_name, dev_name)
I would also put the string at the end. Strings are stored as a dynamic
array in the event, where the position of __string() holds a 32 bit
number. The 16 MSBs is the length of the string and the 16 LSBs is the
offset into the trace event (after the TP_STRUCT__entry).
That means on a 64 bit machine, there's a chance that this struct will
have a 4 byte "hole" between the __string() meta data and the freq
field.
-- Steve
> + __field(unsigned long, freq)
> + __field(unsigned int, polling_ms)
> + __field(unsigned int, load)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(dev_name, dev_name);
> + __entry->freq = freq;
> + __entry->polling_ms = polling_ms;
> + __entry->load = (100 * busy_time) / total_time;
> + ),
> +
> + TP_printk("dev_name=%s freq=%lu polling_ms=%u load=%u",
> + __get_str(dev_name), __entry->freq, __entry->polling_ms,
> + __entry->load)
> +);
> +#endif /* _TRACE_DEVFREQ_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
Hi Steven,
On 2/13/19 2:56 PM, Steven Rostedt wrote:
> On Tue, 12 Feb 2019 23:23:57 +0100
> Lukasz Luba <[email protected]> wrote:
>
>> The patch adds a new file for with trace events for devfreq
>> framework. They are used for performance analysis of the framework.
>> It also contains updates in MAINTAINERS file adding new entry for
>> devfreq maintainers.
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>> MAINTAINERS | 1 +
>> include/trace/events/devfreq.h | 39 +++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 40 insertions(+)
>> create mode 100644 include/trace/events/devfreq.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 9919840..c042fda 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -4447,6 +4447,7 @@ S: Maintained
>> F: drivers/devfreq/
>> F: include/linux/devfreq.h
>> F: Documentation/devicetree/bindings/devfreq/
>> +F: include/trace/events/devfreq.h
>>
>> DEVICE FREQUENCY EVENT (DEVFREQ-EVENT)
>> M: Chanwoo Choi <[email protected]>
>> diff --git a/include/trace/events/devfreq.h b/include/trace/events/devfreq.h
>> new file mode 100644
>> index 0000000..fec9304
>> --- /dev/null
>> +++ b/include/trace/events/devfreq.h
>> @@ -0,0 +1,39 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM devfreq
>> +
>> +#if !defined(_TRACE_DEVFREQ_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_DEVFREQ_H
>> +
>> +#include <linux/devfreq.h>
>> +#include <linux/tracepoint.h>
>> +
>> +TRACE_EVENT(devfreq_monitor,
>> + TP_PROTO(const char *dev_name, unsigned long freq,
>> + unsigned int polling_ms, unsigned long busy_time,
>> + unsigned long total_time),
>> +
>> + TP_ARGS(dev_name, freq, polling_ms, busy_time, total_time),
>> +
>> + TP_STRUCT__entry(
>> + __string(dev_name, dev_name)
>
> I would also put the string at the end. Strings are stored as a dynamic
> array in the event, where the position of __string() holds a 32 bit
> number. The 16 MSBs is the length of the string and the 16 LSBs is the
> offset into the trace event (after the TP_STRUCT__entry).
That's a nice engineering. Thank you for sharing this knowledge.
>
> That means on a 64 bit machine, there's a chance that this struct will
> have a 4 byte "hole" between the __string() meta data and the freq
> field.
I will put the string at the end.
Regards,
Lukasz
>
> -- Steve
>
>> + __field(unsigned long, freq)
>> + __field(unsigned int, polling_ms)
>> + __field(unsigned int, load)
>> + ),
>> +
>> + TP_fast_assign(
>> + __assign_str(dev_name, dev_name);
>> + __entry->freq = freq;
>> + __entry->polling_ms = polling_ms;
>> + __entry->load = (100 * busy_time) / total_time;
>> + ),
>> +
>> + TP_printk("dev_name=%s freq=%lu polling_ms=%u load=%u",
>> + __get_str(dev_name), __entry->freq, __entry->polling_ms,
>> + __entry->load)
>> +);
>> +#endif /* _TRACE_DEVFREQ_H */
>> +
>> +/* This part must be outside protection */
>> +#include <trace/define_trace.h>
>
>
>
Hi Matthias, Chanwoo,
Regarding some measurements data for our approximations.
I have found a comprehensive power consumption measurements
for i.MX6 dual/quad (Cortex-A9) boards.
https://cache.freescale.com/files/32bit/doc/app_note/AN4509.pdf
The boards have DDR3 memories not LPDDR3 PoP but it is also measured
and mentioned explicitly.
There are many use case scenarios covered, which use different devices
in the SoC.
Regards,
Lukasz
On 2/13/19 12:14 PM, Lukasz Luba wrote:
> Hi Chanwoo,
>
> On 2/13/19 1:18 AM, Chanwoo Choi wrote:
>> On 19. 2. 13. 오전 7:23, Lukasz Luba wrote:
>>> This patch set changes workqueue related features in devfreq framework.
>>> First patch switches to delayed work instead of deferred.
>>> The second switches to regular system work and deletes custom 'devfreq'.
>>>
>>> Using deferred work in this context might harm the system performance.
>>> When the CPU enters idle, deferred work is not fired. The devfreq
>>> device's
>>> utilization does not have to be connected with a particular CPU.
>>> The drivers for GPUs, Network on Chip, cache L3 rely on devfreq
>>> governor.
>>> They all are missing opportunity to check the HW state and react when
>>> the deferred work is not fired.
>>> A corner test case, when Dynamic Memory Controller is utilized by
>>> CPUs running
>>> on full speed, might show x5 worse performance if the crucial CPU is
>>> in idle.
>>>
>>> There was a discussion regarding v2 that the power usage because of
>>> waking up
>>
>> We have not yet finished the any discussion. We only just had a few reply
>> and you didn't wait for any reply from other. As I already commented
>> on exynos5422-dmc series, please don't send the next patch
>> without any final agreement.
>>
>> In this series, patch1/patch2 are same at version 2 patchset.
>> Even if we need to discuss more them, you just send same patches
>> without any agreement among reviewers. At least, you have to wait the
>> reply
>> instead of just sending the new patchset. It is basic review rule on
>> mailing list.
> I think you are blocking the fixes.
> Matthias is currently fixing devfreq governors which lack 'case SUSPEND'
> and he is interested in these v3. I have fixed a month ago issue with
> system suspend and OPP state of devfreq devices, which Tobias was rising
> and sending patches a few years ago, but you have blocked them.
>
> These version is ready for comments regarding 'battery powered devices'.
> The v3 has introduced approach for 2 polling intervals similar to
> thermal framework.
> It also has trace events. The trace events are *really* important in
> this discussion because we need some measurements.
>
> You said:
> 'When release the real mobile product like galaxy phone,
> it is very important issue to remove the unneeded wakeup on idle state.'
> Was it a real reason for profiling the devfreq framework and core
> workqueue mechanism? There are embedded devices which are not using
> battery and probably developers were not aware because there was no
> traces which could give any information about devfreq behavior.
> Power is important, performance is important but relaying on randomness
> is not the best solution. As I said earlier, your driver can stuck on
> highest OPP waiting for next callback which will not come...
>
> Regards,
> Lukasz
>>
>>
>>> an idle CPU would be too high. In my opinion it won't and fixing bug
>>> is more
>>> important than < 1% more power used [1].
>>> I have addressed also this issue. In this patch set there is a mechanism
>>> which prevents from to frequent checks when there is no need.
>>> When the device enters its lowest state (OPP) the framework sets polling
>>> interval to 'polling_idle_ms'. In default Kconfig it is 500ms, so 2
>>> times per
>>> second.
>>> It is tunable from the sysfs interface per device, thus driver
>>> developer can
>>> experiment and choose best intervals for the system.
>>>
>>> Changes:
>>> v3:
>>> - reordered first two patches
>>> - added functionality to lower power consumption when the device is
>>> less busy;
>>> there is a new polling interval enabled when device enters lowest
>>> frequency;
>>> - added trace events to capture the behaviour of the system
>>> v2:
>>> - single patch split into two
>>> - added cover letter
>>>
>>> link for the previous version and discussion:
>>> https://marc.info/?l=linux-pm&m=154989907416072&w=2
>>> https://marc.info/?l=linux-pm&m=154904631226997&w=2
>>>
>>> Regards,
>>> Lukasz Luba
>>>
>>> [1] https://marc.info/?l=linux-kernel&m=155000641622443&w=2
>>>
>>> Lukasz Luba (7):
>>> drivers: devfreq: change deferred work into delayed
>>> drivers: devfreq: change devfreq workqueue mechanism
>>> Kconfig: drivers: devfreq: add default idle polling
>>> include: devfreq: add polling_idle_ms to 'profile'
>>> drivers: devfreq: add longer polling interval in idle
>>> trace: events: add devfreq trace event file
>>> drivers: devfreq: add tracing for scheduling work
>>>
>>> MAINTAINERS | 1 +
>>> drivers/devfreq/Kconfig | 13 +++
>>> drivers/devfreq/devfreq.c | 184
>>> ++++++++++++++++++++++++------
>>> drivers/devfreq/governor.h | 3 +-
>>> drivers/devfreq/governor_simpleondemand.c | 6 +-
>>> include/linux/devfreq.h | 6 +
>>> include/trace/events/devfreq.h | 39 +++++++
>>> 7 files changed, 218 insertions(+), 34 deletions(-)
>>> create mode 100644 include/trace/events/devfreq.h
>>>
>>
>>
Hi Lukasz,
On 19. 2. 13. 오후 8:14, Lukasz Luba wrote:
> Hi Chanwoo,
>
> On 2/13/19 1:18 AM, Chanwoo Choi wrote:
>> On 19. 2. 13. 오전 7:23, Lukasz Luba wrote:
>>> This patch set changes workqueue related features in devfreq framework.
>>> First patch switches to delayed work instead of deferred.
>>> The second switches to regular system work and deletes custom 'devfreq'.
>>>
>>> Using deferred work in this context might harm the system performance.
>>> When the CPU enters idle, deferred work is not fired. The devfreq device's
>>> utilization does not have to be connected with a particular CPU.
>>> The drivers for GPUs, Network on Chip, cache L3 rely on devfreq governor.
>>> They all are missing opportunity to check the HW state and react when
>>> the deferred work is not fired.
>>> A corner test case, when Dynamic Memory Controller is utilized by CPUs running
>>> on full speed, might show x5 worse performance if the crucial CPU is in idle.
>>>
>>> There was a discussion regarding v2 that the power usage because of waking up
>>
>> We have not yet finished the any discussion. We only just had a few reply
>> and you didn't wait for any reply from other. As I already commented
>> on exynos5422-dmc series, please don't send the next patch
>> without any final agreement.
>>
>> In this series, patch1/patch2 are same at version 2 patchset.
>> Even if we need to discuss more them, you just send same patches
>> without any agreement among reviewers. At least, you have to wait the reply
>> instead of just sending the new patchset. It is basic review rule on mailing list.
> I think you are blocking the fixes.
> Matthias is currently fixing devfreq governors which lack 'case SUSPEND'
> and he is interested in these v3. I have fixed a month ago issue with
> system suspend and OPP state of devfreq devices, which Tobias was rising
> and sending patches a few years ago, but you have blocked them.
I never blocked any something. I spent many times to review the devfreq patches
to imporve the devfreq frameworw. In case of Tobias's patch,
I just suggested other opinion and approach, but he didn't finish them.
If I blocked some patches for devfreq, why am I review your suspend patch
and agree them? It is very ridiculous of your thinking.
>
> These version is ready for comments regarding 'battery powered devices'.
> The v3 has introduced approach for 2 polling intervals similar to
> thermal framework.
> It also has trace events. The trace events are *really* important in
> this discussion because we need some measurements.
>
> You said:
> 'When release the real mobile product like galaxy phone,
> it is very important issue to remove the unneeded wakeup on idle state.'
> Was it a real reason for profiling the devfreq framework and core
> workqueue mechanism? There are embedded devices which are not using
Yes. I think that deferrable workqueue is very important for some case.
But, on the other hand, it might be always true.
I don't just want to remove the deferrable work. But, I agree that
to support both delayed and deferrable work according to the
choice of devfreq device drivers.
If remove the deferrable work, it always wakeup the idle state.
The device driver never handle them. Someone want to get the
high-performance but, someone don't want to wakeup from idle state.
On your v2 patchset, just remove the deferrable work without
any tunable feature. So, I try to discuss them. But, you just
sent next patches without any discussion. And then, you also
asked that we had to check the next patches.
It is wrong for review sequence. I have to wait the review
/discuss before sending the next patches.
Once again,
I don't just want to remove the deferrable work without
any tunable interface. But, I agree that to support both
delayed and deferrable work according to the choice of
devfreq device drivers.
> battery and probably developers were not aware because there was no
> traces which could give any information about devfreq behavior.
> Power is important, performance is important but relaying on randomness
> is not the best solution. As I said earlier, your driver can stuck on
> highest OPP waiting for next callback which will not come...
>
> Regards,
> Lukasz
>>
>>
>>> an idle CPU would be too high. In my opinion it won't and fixing bug is more
>>> important than < 1% more power used [1].
>>> I have addressed also this issue. In this patch set there is a mechanism
>>> which prevents from to frequent checks when there is no need.
>>> When the device enters its lowest state (OPP) the framework sets polling
>>> interval to 'polling_idle_ms'. In default Kconfig it is 500ms, so 2 times per
>>> second.
>>> It is tunable from the sysfs interface per device, thus driver developer can
>>> experiment and choose best intervals for the system.
>>>
>>> Changes:
>>> v3:
>>> - reordered first two patches
>>> - added functionality to lower power consumption when the device is less busy;
>>> there is a new polling interval enabled when device enters lowest frequency;
>>> - added trace events to capture the behaviour of the system
>>> v2:
>>> - single patch split into two
>>> - added cover letter
>>>
>>> link for the previous version and discussion:
>>> https://marc.info/?l=linux-pm&m=154989907416072&w=2
>>> https://marc.info/?l=linux-pm&m=154904631226997&w=2
>>>
>>> Regards,
>>> Lukasz Luba
>>>
>>> [1] https://marc.info/?l=linux-kernel&m=155000641622443&w=2
>>>
>>> Lukasz Luba (7):
>>> drivers: devfreq: change deferred work into delayed
>>> drivers: devfreq: change devfreq workqueue mechanism
>>> Kconfig: drivers: devfreq: add default idle polling
>>> include: devfreq: add polling_idle_ms to 'profile'
>>> drivers: devfreq: add longer polling interval in idle
>>> trace: events: add devfreq trace event file
>>> drivers: devfreq: add tracing for scheduling work
>>>
>>> MAINTAINERS | 1 +
>>> drivers/devfreq/Kconfig | 13 +++
>>> drivers/devfreq/devfreq.c | 184 ++++++++++++++++++++++++------
>>> drivers/devfreq/governor.h | 3 +-
>>> drivers/devfreq/governor_simpleondemand.c | 6 +-
>>> include/linux/devfreq.h | 6 +
>>> include/trace/events/devfreq.h | 39 +++++++
>>> 7 files changed, 218 insertions(+), 34 deletions(-)
>>> create mode 100644 include/trace/events/devfreq.h
>>>
>>
>>
>
>
--
Best Regards,
Chanwoo Choi
Samsung Electronics
Hi Lukasz,
On 19. 2. 13. 오전 7:23, Lukasz Luba wrote:
> This patch changes deferred work to delayed work, which is now not missed
> when timer is put on CPU that entered idle state.
> The devfreq framework governor was not called, thus changing the device's
> frequency did not happen.
> Benchmarks for stressing Dynamic Memory Controller show x2 (in edge cases
> even x5) performance boost with this patch when 'simpleondemand_governor'
> is responsible for monitoring the device load and frequency changes.
>
> With this patch, the delayed work is done no mater CPUs' idle.
> All of the drivers in devfreq which rely on periodic, guaranteed wakeup
> intervals should benefit from it.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> drivers/devfreq/devfreq.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
> index 0ae3de7..0c9bff8 100644
> --- a/drivers/devfreq/devfreq.c
> +++ b/drivers/devfreq/devfreq.c
> @@ -407,7 +407,7 @@ static void devfreq_monitor(struct work_struct *work)
> */
> void devfreq_monitor_start(struct devfreq *devfreq)
> {
> - INIT_DEFERRABLE_WORK(&devfreq->work, devfreq_monitor);
> + INIT_DELAYED_WORK(&devfreq->work, devfreq_monitor);
As I commented, I can't agree that just changing the type of work
from deferrable to delayed for the always periodic timer.
Instead, if devfreq framework supports both deferrable
and delayed work according to selecting the kind of work
by user, I agree.
> if (devfreq->profile->polling_ms)
> queue_delayed_work(devfreq_wq, &devfreq->work,
> msecs_to_jiffies(devfreq->profile->polling_ms));
>
--
Best Regards,
Chanwoo Choi
Samsung Electronics
Hi Lukasz,
This patch depends on first patch. Firstly, we need to discuss
the first patch and then I'll review this patch.
[1] [PATCH v3 1/7] drivers: devfreq: change deferred work into delayed
On 19. 2. 13. 오전 7:23, Lukasz Luba wrote:
> There is no need for creating another workqueue in the system,
> the existing one should meet the requirements.
> This patch removes devfreq's custom workqueue and uses system one.
> It switches from queue_delayed_work() to schedule_delayed_work().
> It also does not wake up the system when it enters suspend (this
> functionality stays the same).
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> drivers/devfreq/devfreq.c | 25 ++++++-------------------
> 1 file changed, 6 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
> index 0c9bff8..c200b3c 100644
> --- a/drivers/devfreq/devfreq.c
> +++ b/drivers/devfreq/devfreq.c
> @@ -31,13 +31,6 @@
>
> static struct class *devfreq_class;
>
> -/*
> - * devfreq core provides delayed work based load monitoring helper
> - * functions. Governors can use these or can implement their own
> - * monitoring mechanism.
> - */
> -static struct workqueue_struct *devfreq_wq;
> -
> /* The list of all device-devfreq governors */
> static LIST_HEAD(devfreq_governor_list);
> /* The list of all device-devfreq */
> @@ -391,8 +384,8 @@ static void devfreq_monitor(struct work_struct *work)
> if (err)
> dev_err(&devfreq->dev, "dvfs failed with (%d) error\n", err);
>
> - queue_delayed_work(devfreq_wq, &devfreq->work,
> - msecs_to_jiffies(devfreq->profile->polling_ms));
> + schedule_delayed_work(&devfreq->work,
> + msecs_to_jiffies(devfreq->profile->polling_ms));
> mutex_unlock(&devfreq->lock);
> }
>
> @@ -409,7 +402,7 @@ void devfreq_monitor_start(struct devfreq *devfreq)
> {
> INIT_DELAYED_WORK(&devfreq->work, devfreq_monitor);
> if (devfreq->profile->polling_ms)
> - queue_delayed_work(devfreq_wq, &devfreq->work,
> + schedule_delayed_work(&devfreq->work,
> msecs_to_jiffies(devfreq->profile->polling_ms));
> }
> EXPORT_SYMBOL(devfreq_monitor_start);
> @@ -473,7 +466,7 @@ void devfreq_monitor_resume(struct devfreq *devfreq)
>
> if (!delayed_work_pending(&devfreq->work) &&
> devfreq->profile->polling_ms)
> - queue_delayed_work(devfreq_wq, &devfreq->work,
> + schedule_delayed_work(&devfreq->work,
> msecs_to_jiffies(devfreq->profile->polling_ms));
>
> devfreq->last_stat_updated = jiffies;
> @@ -516,7 +509,7 @@ void devfreq_interval_update(struct devfreq *devfreq, unsigned int *delay)
>
> /* if current delay is zero, start polling with new delay */
> if (!cur_delay) {
> - queue_delayed_work(devfreq_wq, &devfreq->work,
> + schedule_delayed_work(&devfreq->work,
> msecs_to_jiffies(devfreq->profile->polling_ms));
> goto out;
> }
> @@ -527,7 +520,7 @@ void devfreq_interval_update(struct devfreq *devfreq, unsigned int *delay)
> cancel_delayed_work_sync(&devfreq->work);
> mutex_lock(&devfreq->lock);
> if (!devfreq->stop_polling)
> - queue_delayed_work(devfreq_wq, &devfreq->work,
> + schedule_delayed_work(&devfreq->work,
> msecs_to_jiffies(devfreq->profile->polling_ms));
> }
> out:
> @@ -1430,12 +1423,6 @@ static int __init devfreq_init(void)
> return PTR_ERR(devfreq_class);
> }
>
> - devfreq_wq = create_freezable_workqueue("devfreq_wq");
> - if (!devfreq_wq) {
> - class_destroy(devfreq_class);
> - pr_err("%s: couldn't create workqueue\n", __FILE__);
> - return -ENOMEM;
> - }
> devfreq_class->dev_groups = devfreq_groups;
>
> return 0;
>
--
Best Regards,
Chanwoo Choi
Samsung Electronics
Hi Lukasz,
On 19. 2. 13. 오전 7:23, Lukasz Luba wrote:
> This patch add basic tracing of the devfreq workqueue and delayed work.
> It aims to capture changes of the polling intervals and device state.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> drivers/devfreq/devfreq.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
> index 29e99ce..c1d0d8c 100644
> --- a/drivers/devfreq/devfreq.c
> +++ b/drivers/devfreq/devfreq.c
> @@ -29,6 +29,9 @@
> #include <linux/of.h>
> #include "governor.h"
>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/devfreq.h>
> +
> /* The ~30% load threshold used for load calculation (due to fixed point
> * arithmetic) */
> #define LOAD_THRESHOLD_IN_DEVICE_USAGE (300)
> @@ -418,6 +421,7 @@ static void devfreq_monitor(struct work_struct *work)
> struct devfreq *devfreq = container_of(work,
> struct devfreq, work.work);
> unsigned int polling_ms;
> + const char *df_name = dev_name(&devfreq->dev);
nit: You can use 'dev_name(&devfreq->dev)' directly
without defining the separate df_name.
>
> mutex_lock(&devfreq->lock);
> polling_ms = devfreq_get_polling_delay(devfreq);
> @@ -429,6 +433,10 @@ static void devfreq_monitor(struct work_struct *work)
> schedule_delayed_work(&devfreq->work,
> msecs_to_jiffies(polling_ms));
> mutex_unlock(&devfreq->lock);
> +
> + trace_devfreq_monitor(df_name, devfreq->previous_freq, polling_ms,
> + devfreq->last_status.busy_time,
> + devfreq->last_status.total_time);
Regardless of type of work,
I think that trace point is necessary for devfreq framework.
Reviewed-by: Chanwoo Choi <[email protected]>
> }
>
> /**
>
--
Best Regards,
Chanwoo Choi
Samsung Electronics
Hi Lukasz,
If the user can select the type of work in accordance with
their choice, either deferrable work or delayed work
for periodic without stop on idle state,
I think that the existing polling_ms is enough.
Because, user determine to use the 'delayed work' for periodic timer,
user can change the polling_ms through already provided sysfs interface
according to multiple situation.
In fact,
If the user want to use the periodic timer without stop on idle state
instead of deferrable work, it means that the user always want to
monitor/check the current load/status of device at the correct interval
and then changing the frequency without any latency.
On 19. 2. 13. 오전 7:23, Lukasz Luba wrote:
> Add needed fields to support new state: idle, where different polling
> interval is in use. It provides better control of the devfreq device
> and lower the power consumption when the device is not busy.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> include/linux/devfreq.h | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
> index fbffa74..5140970 100644
> --- a/include/linux/devfreq.h
> +++ b/include/linux/devfreq.h
> @@ -72,6 +72,11 @@ struct devfreq_dev_status {
> * @initial_freq: The operating frequency when devfreq_add_device() is
> * called.
> * @polling_ms: The polling interval in ms. 0 disables polling.
> + * @polling_idle_ms: The polling interval in 'idle state' in ms.
> + * When the device is running at lowest frequency and has
> + * low-load, it is considered as in 'idle state'.
> + * Thus, longer polling interval is used for the device
> + * to save some power.
> * @target: The device should set its operating frequency at
> * freq or lowest-upper-than-freq value. If freq is
> * higher than any operable frequency, set maximum.
> @@ -98,6 +103,7 @@ struct devfreq_dev_status {
> struct devfreq_dev_profile {
> unsigned long initial_freq;
> unsigned int polling_ms;
> + unsigned int polling_idle_ms;
>
> int (*target)(struct device *dev, unsigned long *freq, u32 flags);
> int (*get_dev_status)(struct device *dev,
>
--
Best Regards,
Chanwoo Choi
Samsung Electronics
Hi Lukasz,
On 19. 2. 13. 오후 10:35, Lukasz Luba wrote:
> Hi Steven,
>
> On 2/13/19 12:14 AM, Steven Rostedt wrote:
>> On Tue, 12 Feb 2019 23:23:57 +0100
>> Lukasz Luba <[email protected]> wrote:
>>
>>> The patch adds a new file for with trace events for devfreq
>>> framework. They are used for performance analysis of the framework.
>>> It also contains updates in MAINTAINERS file adding new entry for
>>> devfreq maintainers.
>>>
>>> Signed-off-by: Lukasz Luba <[email protected]>
>>> ---
>>> MAINTAINERS | 1 +
>>> include/trace/events/devfreq.h | 39 +++++++++++++++++++++++++++++++++++++++
>>> 2 files changed, 40 insertions(+)
>>> create mode 100644 include/trace/events/devfreq.h
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index 9919840..c042fda 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -4447,6 +4447,7 @@ S: Maintained
>>> F: drivers/devfreq/
>>> F: include/linux/devfreq.h
>>> F: Documentation/devicetree/bindings/devfreq/
>>> +F: include/trace/events/devfreq.h
>>>
>>> DEVICE FREQUENCY EVENT (DEVFREQ-EVENT)
>>> M: Chanwoo Choi <[email protected]>
>>> diff --git a/include/trace/events/devfreq.h b/include/trace/events/devfreq.h
>>> new file mode 100644
>>> index 0000000..fec9304
>>> --- /dev/null
>>> +++ b/include/trace/events/devfreq.h
>>> @@ -0,0 +1,39 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +#undef TRACE_SYSTEM
>>> +#define TRACE_SYSTEM devfreq
>>> +
>>> +#if !defined(_TRACE_DEVFREQ_H) || defined(TRACE_HEADER_MULTI_READ)
>>> +#define _TRACE_DEVFREQ_H
>>> +
>>> +#include <linux/devfreq.h>
>>> +#include <linux/tracepoint.h>
>>> +
>>> +TRACE_EVENT(devfreq_monitor,
>>> + TP_PROTO(const char *dev_name, unsigned long freq,
>>> + unsigned int polling_ms, unsigned long busy_time,
>>> + unsigned long total_time),
>>> +
>>> + TP_ARGS(dev_name, freq, polling_ms, busy_time, total_time),
>>> +
>>> + TP_STRUCT__entry(
>>> + __string(dev_name, dev_name)
>>> + __field(unsigned long, freq)
>>> + __field(unsigned int, polling_ms)
>>> + __field(unsigned int, load)
>>> + ),
>>> +
>>> + TP_fast_assign(
>>> + __assign_str(dev_name, dev_name);
>>> + __entry->freq = freq;
>>> + __entry->polling_ms = polling_ms;
>>> + __entry->load = (100 * busy_time) / total_time;
>>
>> How critical is the code that this trace event is called in. If it is
>> not that critical (it is a slow path), then this is fine, but if this
>> is not a slow path (something triggered 100 HZ or less), then I would
>> recommend moving the above calculation to TP_printk(). A divide is a
>> slow operation, and is best to do that in the post processing. The
>> current location does the divide at the time of the tracepoint is
>> called.
> I wasn't aware of these two stages, good to know.
> I will move it to TP_printk().
>>
>> I would also have a check to make sure that total_time is not zero
>> here, otherwise that could be bad.
>>
>> __entry->busy_time = busy_time;
>> __entry->total_time = total_time;
>>
>
>>> + ),
>>> +
>>> + TP_printk("dev_name=%s freq=%lu polling_ms=%u load=%u",
>>
>>
>>> + __get_str(dev_name), __entry->freq, __entry->polling_ms,
>>
>> __entry->total_time == 0 ? 100 :
>> __entry->busy_time / __entry->total_time)
> Thank you for the review.
> I will add this check.
>
> Regards,
> Lukasz
>>
>> -- Steve
>>
>>> + __entry->load)
>>> +);
>>> +#endif /* _TRACE_DEVFREQ_H */
>>> +
>>> +/* This part must be outside protection */
>>> +#include <trace/define_trace.h>
>>
>>
>>
>
>
I agree that trace point is necessary for devfreq framework.
Reviewed-by: Chanwoo Choi <[email protected]>
--
Best Regards,
Chanwoo Choi
Samsung Electronics
> This patch adds new mechanism for devfreq devices which changes polling
> interval. The system should sleep longer when the devfreq device is almost
> not used. The devfreq framework will not schedule the work too often.
> This low-load state is recognised when the device is operating at the lowest
> frequency and has low busy_time/total_time factor (< 30%).
> When the frequency is different then min, the device is under normal polling
> which is the value defined in driver's 'polling_ms'.
> When the device is getting more pressure, the framework is able to catch it
> based on 'load' in lowest frequency and will start polling more frequently.
> The second scenario is when the governor recognised heavy load at minimum
> frequency and increases the frequency. The devfreq framework will start
> polling in shorter intervals.
> The polling interval, when the device is not heavily, can also be changed
> from userspace of defined by the driver's author.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> drivers/devfreq/devfreq.c | 151 +++++++++++++++++++++++++++---
> drivers/devfreq/governor.h | 3 +-
> drivers/devfreq/governor_simpleondemand.c | 6 +-
> 3 files changed, 145 insertions(+), 15 deletions(-)
>
There are some requirements that you need to consider:
Is 30% really applicable to ALL devfreq devices?
- What if some devices do not want such behaviors?
- What if some devices want different values (change behavors)?
- What if some manufactures want different default values?
- What if some devices want to let the framework know that it's in idle?
- What if some other kernel context, device (drivers),
or userspace process want to notify that it's no more idling?
As mentioned in the internal thread (tizen.org),
I'm not convinced by the idea of assuming that a device can be considered "idling"
if it has simply "low" utilization.
You are going to deteriorate the UI response time of mobile devices significantly.
Cheers,
MyungJoo.
Hi MyungJoo,
Thank you for taking part in the discussion.
Please check my comments below.
On 2/18/19 5:33 AM, MyungJoo Ham wrote:
>> This patch adds new mechanism for devfreq devices which changes polling
>> interval. The system should sleep longer when the devfreq device is almost
>> not used. The devfreq framework will not schedule the work too often.
>> This low-load state is recognised when the device is operating at the lowest
>> frequency and has low busy_time/total_time factor (< 30%).
>> When the frequency is different then min, the device is under normal polling
>> which is the value defined in driver's 'polling_ms'.
>> When the device is getting more pressure, the framework is able to catch it
>> based on 'load' in lowest frequency and will start polling more frequently.
>> The second scenario is when the governor recognised heavy load at minimum
>> frequency and increases the frequency. The devfreq framework will start
>> polling in shorter intervals.
>> The polling interval, when the device is not heavily, can also be changed
>> from userspace of defined by the driver's author.
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>> drivers/devfreq/devfreq.c | 151 +++++++++++++++++++++++++++---
>> drivers/devfreq/governor.h | 3 +-
>> drivers/devfreq/governor_simpleondemand.c | 6 +-
>> 3 files changed, 145 insertions(+), 15 deletions(-)
>>
>
> There are some requirements that you need to consider:
>
> Is 30% really applicable to ALL devfreq devices?
The 30% load while the device is on lowest OPP is to filter some noise.
It might be tunable over sysfs for each device if you like.
> - What if some devices do not want such behaviors?
They can set polling_idle_ms and polling_ms the same value.
> - What if some devices want different values (change behavors)?
Need of sysfs tunable here.
> - What if some manufactures want different default values?
Like above (sysfs).
> - What if some devices want to let the framework know that it's in idle?
There might be a filed in devfreq->state which could handle this.
> - What if some other kernel context, device (drivers),
> or userspace process want to notify that it's no more idling?This issue is more related to the new movement in the 'interconnect'
development. They have a goal for this kind of interactions and QoS
between devices or their clients. In devfreq it would be possible
to tackle this, but would require a lot of changes (notification chain,
state machines in devices,
>
> As mentioned in the internal thread (tizen.org),
> I'm not convinced by the idea of assuming that a device can be considered "idling"
> if it has simply "low" utilization.
>
> You are going to deteriorate the UI response time of mobile devices significantly.
Current devfreq wake-up also does not guarantee that, maybe on a single
CPU platform does.
I will try to address your and Chanwoo's comments that the devfreq still
needs deferred polling in some platforms.
Would it be OK if we have two options: deferred and delayed work while
registering a wakeup for a device?
That would be a function like: polling_mode_init(devfreq) instead of
simple INIT_DEFERRED_WORK(), which will check the device's preference.
The device driver could set a filed in 'polling_mode' to enum:
POWER_EFFICIENT or RELIABLE_INTERVAL. For compatibility with old drivers
where the polling_mode = 0, SYSTEM_DEFAULT_POLLING_MODE (which is one
of these two) would be used.
Then the two-phase-polling-interval from this patch could only be used
for the RELIABLE_INTERVAL configuration or even dropped.
Regards,
Lukasz
>
> Cheers,
> MyungJoo.
>
>
>>
>> There are some requirements that you need to consider:
>>
>> Is 30% really applicable to ALL devfreq devices?
>The 30% load while the device is on lowest OPP is to filter some noise.
>It might be tunable over sysfs for each device if you like.
>> - What if some devices do not want such behaviors?
>They can set polling_idle_ms and polling_ms the same value.
>> - What if some devices want different values (change behavors)?
>Need of sysfs tunable here.
>> - What if some manufactures want different default values?
>Like above (sysfs).
>> - What if some devices want to let the framework know that it's in idle?
>There might be a filed in devfreq->state which could handle this.
>> - What if some other kernel context, device (drivers),
>> or userspace process want to notify that it's no more idling?This issue is more related to the new movement in the 'interconnect'
>development. They have a goal for this kind of interactions and QoS
>between devices or their clients. In devfreq it would be possible
>to tackle this, but would require a lot of changes (notification chain,
>state machines in devices,
>
>>
>> As mentioned in the internal thread (tizen.org),
>> I'm not convinced by the idea of assuming that a device can be considered "idling"
>> if it has simply "low" utilization.
>>
>> You are going to deteriorate the UI response time of mobile devices significantly.
>Current devfreq wake-up also does not guarantee that, maybe on a single
>CPU platform does.
Yes, the current devfreq does not enhance UI response time in the sense
that it would keep the same reponse time anyway.
However, your current approach will surely deteriorate it by lengthen
the polling latency.
For mobile and wearable devices, in many cases I've been witnessing,
the device idles right before the user's input (launching an app,
scrolling messages or web pages, or press a play button).
For mitigation, we often relay UI inputs to DVFS mechanisms so that
we either increase frequency for any UI inputs for a short period or
shorten the polling latency for a short period.
When a highly user-interactive device is idling or operating a low frequency,
we should assume that it's going to be highly performing anytime;
loosening the checking period is not a good solution in that sense
although it is probable for servers or workstations.
But, I don't think servers/workstations do care power consumption of
DVFS checking loops anyway.
>
>I will try to address your and Chanwoo's comments that the devfreq still
>needs deferred polling in some platforms.
>Would it be OK if we have two options: deferred and delayed work while
>registering a wakeup for a device?
>That would be a function like: polling_mode_init(devfreq) instead of
>simple INIT_DEFERRED_WORK(), which will check the device's preference.
>The device driver could set a filed in 'polling_mode' to enum:
>POWER_EFFICIENT or RELIABLE_INTERVAL. For compatibility with old drivers
>where the polling_mode = 0, SYSTEM_DEFAULT_POLLING_MODE (which is one
>of these two) would be used.
>Then the two-phase-polling-interval from this patch could only be used
>for the RELIABLE_INTERVAL configuration or even dropped.
One thing I want to add is: let's not over-complicate it.
Do you have any experimental results on how much power is saved by doing this?
(and user response time losses)
Cheers,
MyungJoo
Hi MyungJoo,
On 2/21/19 6:56 AM, MyungJoo Ham wrote:
>>>
>>> There are some requirements that you need to consider:
>>>
>>> Is 30% really applicable to ALL devfreq devices?
>> The 30% load while the device is on lowest OPP is to filter some noise.
>> It might be tunable over sysfs for each device if you like.
>>> - What if some devices do not want such behaviors?
>> They can set polling_idle_ms and polling_ms the same value.
>>> - What if some devices want different values (change behavors)?
>> Need of sysfs tunable here.
>>> - What if some manufactures want different default values?
>> Like above (sysfs).
>>> - What if some devices want to let the framework know that it's in idle?
>> There might be a filed in devfreq->state which could handle this.
>>> - What if some other kernel context, device (drivers),
>>> or userspace process want to notify that it's no more idling?This issue is more related to the new movement in the 'interconnect'
>> development. They have a goal for this kind of interactions and QoS
>> between devices or their clients. In devfreq it would be possible
>> to tackle this, but would require a lot of changes (notification chain,
>> state machines in devices,
>>
>>>
>>> As mentioned in the internal thread (tizen.org),
>>> I'm not convinced by the idea of assuming that a device can be considered "idling"
>>> if it has simply "low" utilization.
>>>
>>> You are going to deteriorate the UI response time of mobile devices significantly.
>> Current devfreq wake-up also does not guarantee that, maybe on a single
>> CPU platform does.
>
> Yes, the current devfreq does not enhance UI response time in the sense
> that it would keep the same reponse time anyway.
>
> However, your current approach will surely deteriorate it by lengthen
> the polling latency.
>
> For mobile and wearable devices, in many cases I've been witnessing,
> the device idles right before the user's input (launching an app,
> scrolling messages or web pages, or press a play button).
> For mitigation, we often relay UI inputs to DVFS mechanisms so that
> we either increase frequency for any UI inputs for a short period or
> shorten the polling latency for a short period.
> When a highly user-interactive device is idling or operating a low frequency,
> we should assume that it's going to be highly performing anytime;
> loosening the checking period is not a good solution in that sense
> although it is probable for servers or workstations.
> But, I don't think servers/workstations do care power consumption of
> DVFS checking loops anyway.
It is not just mobiles or servers (I agree servers are not in scope).
Devices like TVs, set-top-boxes, IP cameras, car IVI, they might care
about DVFS, but they do not have batteries (maybe a big car battery...).
They probably have more than 1 CPU and have some accelerators, which
they might register devfreq devices.
>
>
>
>>
>> I will try to address your and Chanwoo's comments that the devfreq still
>> needs deferred polling in some platforms.
>> Would it be OK if we have two options: deferred and delayed work while
>> registering a wakeup for a device?
>> That would be a function like: polling_mode_init(devfreq) instead of
>> simple INIT_DEFERRED_WORK(), which will check the device's preference.
>> The device driver could set a filed in 'polling_mode' to enum:
>> POWER_EFFICIENT or RELIABLE_INTERVAL. For compatibility with old drivers
>> where the polling_mode = 0, SYSTEM_DEFAULT_POLLING_MODE (which is one
>> of these two) would be used.
>> Then the two-phase-polling-interval from this patch could only be used
>> for the RELIABLE_INTERVAL configuration or even dropped.
>
> One thing I want to add is: let's not over-complicate it.
> Do you have any experimental results on how much power is saved by doing this?
> (and user response time losses)
No I don't have such results. Maybe it is not even worth to add this
two-phase-polling because the polling using DELAYED does not increase
the power consumption too much. Like for the device mentioned above.
Simplest first step.
What would you say to add just the possibility of choosing between:
INIT_DEFERRED_WORK() and INIT_DELAYED_WORK() in runtime via sysfs?
There would be also Kconfig which sets default.
No additional code for different custom polling modes.
Regards,
Lukasz
>
>
> Cheers,
> MyungJoo
>
>
>