2019-10-21 14:37:12

by Ram Prakash Gupta

[permalink] [raw]
Subject: [RFC 0/6] mmc: Add clock scaling support for mmc driver

This change adds the use of devfreq based clock scaling to MMC.
This applicable for eMMC and SDCard.
For some workloads, such as video playback, it isn't necessary
for these cards to run at high speed. Running at lower
frequency, in such cases can still meet the deadlines for data
transfers.

Scaling down the clock frequency dynamically has power savings
not only because the bus is running at lower frequency but also
has an advantage of scaling down the system core voltage, if
supported. Provide an ondemand clock scaling support similar
to the cpufreq ondemand governor having two thresholds,
up_threshold and down_threshold to decide whether to increase
the frequency or scale it down respectively as per load.


Ram Prakash Gupta (6):
mmc: core: Parse clk scaling dt entries
mmc: core: Add core scaling support in driver
mmc: core: Initialize clk scaling for mmc and SDCard
mmc: core: Add debugfs entries for scaling support
mmc: sdhci-msm: Add capability in platfrom host
dt-bindings: mmc: sdhci-msm: Add clk scaling dt parameters

.../devicetree/bindings/mmc/sdhci-msm.txt | 19 +
drivers/mmc/core/block.c | 19 +-
drivers/mmc/core/core.c | 777 +++++++++++++++++++++
drivers/mmc/core/core.h | 17 +
drivers/mmc/core/debugfs.c | 90 +++
drivers/mmc/core/host.c | 226 ++++++
drivers/mmc/core/mmc.c | 246 ++++++-
drivers/mmc/core/queue.c | 2 +
drivers/mmc/core/sd.c | 84 ++-
drivers/mmc/host/sdhci-msm.c | 2 +
include/linux/mmc/card.h | 7 +
include/linux/mmc/host.h | 66 ++
12 files changed, 1550 insertions(+), 5 deletions(-)

--
1.9.1


2019-10-21 14:37:25

by Ram Prakash Gupta

[permalink] [raw]
Subject: [RFC 1/6] mmc: core: Parse clk scaling dt entries

Add support to parse clk scaling dt entries in mmc
driver.

Signed-off-by: Sujit Reddy Thumma <[email protected]>
Signed-off-by: Talel Shenhar <[email protected]>
Signed-off-by: Sahitya Tummala <[email protected]>
Signed-off-by: Asutosh Das <[email protected]>
Signed-off-by: Ritesh Harjani <[email protected]>
Signed-off-by: Veerabhadrarao Badiganti <[email protected]>
Signed-off-by: Bao D. Nguyen <[email protected]>
Signed-off-by: Can Guo <[email protected]>
Signed-off-by: Sayali Lokhande <[email protected]>
Co-Developed-by: Ram Prakash Gupta <[email protected]>
Signed-off-by: Ram Prakash Gupta <[email protected]>
---
drivers/mmc/core/host.c | 63 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/mmc/card.h | 7 +++++
include/linux/mmc/host.h | 66 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 136 insertions(+)

diff --git a/drivers/mmc/core/host.c b/drivers/mmc/core/host.c
index 105b7a7..672f2d6 100644
--- a/drivers/mmc/core/host.c
+++ b/drivers/mmc/core/host.c
@@ -161,6 +161,44 @@ static void mmc_retune_timer(struct timer_list *t)
mmc_retune_needed(host);
}

+static int mmc_dt_get_array(struct device *dev, const char *prop_name,
+ u32 **out, int *len, u32 size)
+{
+ int ret = 0;
+ struct device_node *np = dev->of_node;
+ size_t sz;
+ u32 *arr = NULL;
+
+ if (!of_get_property(np, prop_name, len)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ sz = *len = *len / sizeof(*arr);
+ if (sz <= 0 || (size > 0 && (sz > size))) {
+ dev_err(dev, "%s invalid size\n", prop_name);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ arr = devm_kcalloc(dev, sz, sizeof(*arr), GFP_KERNEL);
+ if (!arr) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ret = of_property_read_u32_array(np, prop_name, arr, sz);
+ if (ret < 0) {
+ dev_err(dev, "%s failed reading array %d\n", prop_name, ret);
+ goto out;
+ }
+ *out = arr;
+out:
+ if (ret)
+ *len = 0;
+ return ret;
+
+}
+
/**
* mmc_of_parse() - parse host's device-tree node
* @host: host whose node should be parsed.
@@ -173,10 +211,12 @@ static void mmc_retune_timer(struct timer_list *t)
int mmc_of_parse(struct mmc_host *host)
{
struct device *dev = host->parent;
+ struct device_node *np = dev->of_node;
u32 bus_width, drv_type, cd_debounce_delay_ms;
int ret;
bool cd_cap_invert, cd_gpio_invert = false;
bool ro_cap_invert, ro_gpio_invert = false;
+ const char *lower_bus_speed = NULL;

if (!dev || !dev_fwnode(dev))
return 0;
@@ -340,6 +380,29 @@ int mmc_of_parse(struct mmc_host *host)
device_property_read_u32(dev, "post-power-on-delay-ms",
&host->ios.power_delay_ms);

+ if (mmc_dt_get_array(dev, "devfreq,freq-table",
+ &host->clk_scaling.pltfm_freq_table,
+ &host->clk_scaling.pltfm_freq_table_sz, 0))
+ pr_debug("%s: no clock scaling frequencies were supplied\n",
+ dev_name(dev));
+ else if (!host->clk_scaling.pltfm_freq_table ||
+ host->clk_scaling.pltfm_freq_table_sz)
+ dev_info(dev, "bad dts clock scaling frequencies\n");
+
+ /*
+ * Few hosts can support DDR52 mode at the same lower
+ * system voltage corner as high-speed mode. In such
+ * cases, it is always better to put it in DDR
+ * mode which will improve the performance
+ * without any power impact.
+ */
+ if (!of_property_read_string(np, "scaling-lower-bus-speed-mode",
+ &lower_bus_speed)) {
+ if (!strncmp(lower_bus_speed, "DDR52", strlen(lower_bus_speed)))
+ host->clk_scaling.lower_bus_speed_mode |=
+ MMC_SCALING_LOWER_DDR52_MODE;
+ }
+
return mmc_pwrseq_alloc(host);
}

diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
index 9b6336a..8b577a1 100644
--- a/include/linux/mmc/card.h
+++ b/include/linux/mmc/card.h
@@ -244,6 +244,12 @@ struct mmc_card {
struct mmc_host *host; /* the host this device belongs to */
struct device dev; /* the device */
u32 ocr; /* the current OCR setting */
+ unsigned long clk_scaling_lowest; /* lowest scaleable
+ * frequency
+ */
+ unsigned long clk_scaling_highest; /* highest scaleable
+ * frequency
+ */
unsigned int rca; /* relative card address of device */
unsigned int type; /* card type */
#define MMC_TYPE_MMC 0 /* MMC card */
@@ -306,6 +312,7 @@ struct mmc_card {
struct dentry *debugfs_root;
struct mmc_part part[MMC_NUM_PHY_PARTITION]; /* physical partitions */
unsigned int nr_parts;
+ unsigned int part_curr;

unsigned int bouncesz; /* Bounce buffer size */
struct workqueue_struct *complete_wq; /* Private workqueue */
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index ba70338..bfb47d6 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -9,6 +9,7 @@

#include <linux/sched.h>
#include <linux/device.h>
+#include <linux/devfreq.h>
#include <linux/fault-inject.h>

#include <linux/mmc/core.h>
@@ -268,9 +269,72 @@ struct mmc_ctx {
struct task_struct *task;
};

+enum dev_state {
+ DEV_SUSPENDING = 1,
+ DEV_SUSPENDED,
+ DEV_RESUMED,
+};
+
+enum mmc_load {
+ MMC_LOAD_HIGH,
+ MMC_LOAD_LOW,
+};
+
+/**
+ * struct mmc_devfeq_clk_scaling - main context for MMC clock scaling logic
+ *
+ * @lock: spinlock to protect statistics
+ * @devfreq: struct that represent mmc-host as a client for devfreq
+ * @devfreq_profile: MMC device profile, mostly polling interval and callbacks
+ * @ondemand_gov_data: struct supplied to ondemmand governor (thresholds)
+ * @state: load state, can be HIGH or LOW. used to notify mmc_host_ops callback
+ * @start_busy: timestamped armed once a data request is started
+ * @measure_interval_start: timestamped armed once a measure interval started
+ * @devfreq_abort: flag to sync between different contexts relevant to devfreq
+ * @skip_clk_scale_freq_update: flag that enable/disable frequency change
+ * @freq_table_sz: table size of frequencies supplied to devfreq
+ * @freq_table: frequencies table supplied to devfreq
+ * @curr_freq: current frequency
+ * @polling_delay_ms: polling interval for status collection used by devfreq
+ * @upthreshold: up-threshold supplied to ondemand governor
+ * @downthreshold: down-threshold supplied to ondemand governor
+ * @need_freq_change: flag indicating if a frequency change is required
+ * @is_busy_started: flag indicating if a request is handled by the HW
+ * @enable: flag indicating if the clock scaling logic is enabled for this host
+ * @is_suspended: to make devfreq request queued when mmc is suspened
+ */
+struct mmc_devfeq_clk_scaling {
+ spinlock_t lock;
+ struct devfreq *devfreq;
+ struct devfreq_dev_profile devfreq_profile;
+ struct devfreq_simple_ondemand_data ondemand_gov_data;
+ enum mmc_load state;
+ ktime_t start_busy;
+ ktime_t measure_interval_start;
+ atomic_t devfreq_abort;
+ bool skip_clk_scale_freq_update;
+ int freq_table_sz;
+ int pltfm_freq_table_sz;
+ u32 *freq_table;
+ u32 *pltfm_freq_table;
+ unsigned long total_busy_time_us;
+ unsigned long target_freq;
+ unsigned long curr_freq;
+ unsigned long polling_delay_ms;
+ unsigned int upthreshold;
+ unsigned int downthreshold;
+ unsigned int lower_bus_speed_mode;
+#define MMC_SCALING_LOWER_DDR52_MODE 1
+ bool need_freq_change;
+ bool is_busy_started;
+ bool enable;
+ bool is_suspended;
+};
+
struct mmc_host {
struct device *parent;
struct device class_dev;
+ struct mmc_devfeq_clk_scaling clk_scaling;
int index;
const struct mmc_host_ops *ops;
struct mmc_pwrseq *pwrseq;
@@ -369,6 +433,7 @@ struct mmc_host {
#define MMC_CAP2_CQE_DCMD (1 << 24) /* CQE can issue a direct command */
#define MMC_CAP2_AVOID_3_3V (1 << 25) /* Host must negotiate down from 3.3V */
#define MMC_CAP2_MERGE_CAPABLE (1 << 26) /* Host can merge a segment over the segment size */
+#define MMC_CAP2_CLK_SCALE (1 << 27) /* Allow dynamic clk scaling */

int fixed_drv_type; /* fixed driver type for non-removable media */

@@ -462,6 +527,7 @@ struct mmc_host {
bool cqe_enabled;
bool cqe_on;

+ atomic_t active_reqs;
unsigned long private[0] ____cacheline_aligned;
};

--
1.9.1

2019-10-21 14:37:36

by Ram Prakash Gupta

[permalink] [raw]
Subject: [RFC 3/6] mmc: core: Initialize clk scaling for mmc and SDCard

Initialize clk scaling for mmc and SDCard driver.
For mmc scaling is initialized for both cqe and legacy mode.

Signed-off-by: Sujit Reddy Thumma <[email protected]>
Signed-off-by: Talel Shenhar <[email protected]>
Signed-off-by: Sahitya Tummala <[email protected]>
Signed-off-by: Asutosh Das <[email protected]>
Signed-off-by: Ritesh Harjani <[email protected]>
Signed-off-by: Veerabhadrarao Badiganti <[email protected]>
Signed-off-by: Bao D. Nguyen <[email protected]>
Signed-off-by: Can Guo <[email protected]>
Signed-off-by: Sayali Lokhande <[email protected]>
Co-Developed-by: Ram Prakash Gupta <[email protected]>
Signed-off-by: Ram Prakash Gupta <[email protected]>
---
drivers/mmc/core/block.c | 19 ++++++++++++++++-
drivers/mmc/core/core.c | 8 +++++++
drivers/mmc/core/host.c | 10 +++++++++
drivers/mmc/core/mmc.c | 54 +++++++++++++++++++++++++++++++++++++++++++++---
drivers/mmc/core/queue.c | 2 ++
drivers/mmc/core/sd.c | 32 ++++++++++++++++++++++++++--
6 files changed, 119 insertions(+), 6 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 2c71a43..ce6a754 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -877,6 +877,7 @@ static inline int mmc_blk_part_switch(struct mmc_card *card,
}

card->ext_csd.part_config = part_config;
+ card->part_curr = part_type;

ret = mmc_blk_part_switch_post(card, main_md->part_curr);
}
@@ -1457,6 +1458,7 @@ static void mmc_blk_cqe_complete_rq(struct mmc_queue *mq, struct request *req)
unsigned long flags;
bool put_card;
int err;
+ bool is_dcmd = false;

mmc_cqe_post_req(host, mrq);

@@ -1484,16 +1486,20 @@ static void mmc_blk_cqe_complete_rq(struct mmc_queue *mq, struct request *req)
spin_lock_irqsave(&mq->lock, flags);

mq->in_flight[mmc_issue_type(mq, req)] -= 1;
+ atomic_dec(&host->active_reqs);

put_card = (mmc_tot_in_flight(mq) == 0);

mmc_cqe_check_busy(mq);

+ is_dcmd = (mmc_issue_type(mq, req) == MMC_ISSUE_DCMD);
spin_unlock_irqrestore(&mq->lock, flags);

if (!mq->cqe_busy)
blk_mq_run_hw_queues(q, true);

+ mmc_cqe_clk_scaling_stop_busy(host, true, is_dcmd);
+
if (put_card)
mmc_put_card(mq->card, &mq->ctx);
}
@@ -1572,10 +1578,19 @@ static int mmc_blk_cqe_issue_flush(struct mmc_queue *mq, struct request *req)
static int mmc_blk_cqe_issue_rw_rq(struct mmc_queue *mq, struct request *req)
{
struct mmc_queue_req *mqrq = req_to_mmc_queue_req(req);
+ int err = 0;

mmc_blk_data_prep(mq, mqrq, 0, NULL, NULL);

- return mmc_blk_cqe_start_req(mq->card->host, &mqrq->brq.mrq);
+ mmc_deferred_scaling(mq->card->host);
+ mmc_cqe_clk_scaling_start_busy(mq, mq->card->host, true);
+
+ err = mmc_blk_cqe_start_req(mq->card->host, &mqrq->brq.mrq);
+
+ if (err)
+ mmc_cqe_clk_scaling_stop_busy(mq->card->host, true, false);
+
+ return err;
}

static void mmc_blk_rw_rq_prep(struct mmc_queue_req *mqrq,
@@ -1986,12 +2001,14 @@ static void mmc_blk_mq_poll_completion(struct mmc_queue *mq,

static void mmc_blk_mq_dec_in_flight(struct mmc_queue *mq, struct request *req)
{
+ struct mmc_host *host = mq->card->host;
unsigned long flags;
bool put_card;

spin_lock_irqsave(&mq->lock, flags);

mq->in_flight[mmc_issue_type(mq, req)] -= 1;
+ atomic_dec(&host->active_reqs);

put_card = (mmc_tot_in_flight(mq) == 0);

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index f0c233c..6e99c30 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -909,6 +909,9 @@ void mmc_request_done(struct mmc_host *host, struct mmc_request *mrq)
struct mmc_command *cmd = mrq->cmd;
int err = cmd->error;

+ if (host->clk_scaling.is_busy_started)
+ mmc_clk_scaling_stop_busy(host, true);
+
/* Flag re-tuning needed on CRC errors */
if (cmd->opcode != MMC_SEND_TUNING_BLOCK &&
cmd->opcode != MMC_SEND_TUNING_BLOCK_HS200 &&
@@ -1120,6 +1123,11 @@ int mmc_start_request(struct mmc_host *host, struct mmc_request *mrq)
if (err)
return err;

+ if (mmc_is_data_request(mrq)) {
+ mmc_deferred_scaling(host);
+ mmc_clk_scaling_start_busy(host, true);
+ }
+
led_trigger_event(host->led, LED_FULL);
__mmc_start_request(host, mrq);

diff --git a/drivers/mmc/core/host.c b/drivers/mmc/core/host.c
index 672f2d6..9138041 100644
--- a/drivers/mmc/core/host.c
+++ b/drivers/mmc/core/host.c
@@ -31,6 +31,10 @@

#define cls_dev_to_mmc_host(d) container_of(d, struct mmc_host, class_dev)

+#define MMC_DEVFRQ_DEFAULT_UP_THRESHOLD 35
+#define MMC_DEVFRQ_DEFAULT_DOWN_THRESHOLD 5
+#define MMC_DEVFRQ_DEFAULT_POLLING_MSEC 100
+
static DEFINE_IDA(mmc_host_ida);

static void mmc_host_classdev_release(struct device *dev)
@@ -493,6 +497,7 @@ struct mmc_host *mmc_alloc_host(int extra, struct device *dev)
}

spin_lock_init(&host->lock);
+ atomic_set(&host->active_reqs, 0);
init_waitqueue_head(&host->wq);
INIT_DELAYED_WORK(&host->detect, mmc_rescan);
INIT_DELAYED_WORK(&host->sdio_irq_work, sdio_irq_work);
@@ -537,6 +542,11 @@ int mmc_add_host(struct mmc_host *host)
return err;

led_trigger_register_simple(dev_name(&host->class_dev), &host->led);
+ host->clk_scaling.upthreshold = MMC_DEVFRQ_DEFAULT_UP_THRESHOLD;
+ host->clk_scaling.downthreshold = MMC_DEVFRQ_DEFAULT_DOWN_THRESHOLD;
+ host->clk_scaling.polling_delay_ms = MMC_DEVFRQ_DEFAULT_POLLING_MSEC;
+ host->clk_scaling.skip_clk_scale_freq_update = false;
+

#ifdef CONFIG_DEBUG_FS
mmc_add_host_debugfs(host);
diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index c9fccfc..bd8af5c 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -1967,6 +1967,16 @@ static int mmc_init_card(struct mmc_host *host, u32 ocr,
}
}

+ card->clk_scaling_lowest = host->f_min;
+ if ((card->mmc_avail_type & EXT_CSD_CARD_TYPE_HS400) ||
+ (card->mmc_avail_type & EXT_CSD_CARD_TYPE_HS200))
+ card->clk_scaling_highest = card->ext_csd.hs200_max_dtr;
+ else if ((card->mmc_avail_type & EXT_CSD_CARD_TYPE_HS) ||
+ (card->mmc_avail_type & EXT_CSD_CARD_TYPE_DDR_52))
+ card->clk_scaling_highest = card->ext_csd.hs_max_dtr;
+ else
+ card->clk_scaling_highest = card->csd.max_dtr;
+
/*
* Choose the power class with selected bus interface
*/
@@ -2164,6 +2174,7 @@ static int mmc_poweroff_notify(struct mmc_card *card, unsigned int notify_type)
*/
static void mmc_remove(struct mmc_host *host)
{
+ mmc_exit_clk_scaling(host);
mmc_remove_card(host->card);
host->card = NULL;
}
@@ -2207,6 +2218,12 @@ static int _mmc_suspend(struct mmc_host *host, bool is_suspend)
int err = 0;
unsigned int notify_type = is_suspend ? EXT_CSD_POWER_OFF_SHORT :
EXT_CSD_POWER_OFF_LONG;
+ err = mmc_suspend_clk_scaling(host);
+ if (err) {
+ pr_err("%s: %s: fail to suspend clock scaling (%d)\n",
+ mmc_hostname(host), __func__, err);
+ return err;
+ }

mmc_claim_host(host);

@@ -2231,6 +2248,9 @@ static int _mmc_suspend(struct mmc_host *host, bool is_suspend)
}
out:
mmc_release_host(host);
+ if (err)
+ mmc_resume_clk_scaling(host);
+
return err;
}

@@ -2260,15 +2280,21 @@ static int _mmc_resume(struct mmc_host *host)

mmc_claim_host(host);

- if (!mmc_card_suspended(host->card))
+ if (!mmc_card_suspended(host->card)) {
+ mmc_release_host(host);
goto out;
+ }

mmc_power_up(host, host->card->ocr);
err = mmc_init_card(host, host->card->ocr, host->card);
mmc_card_clr_suspended(host->card);

-out:
mmc_release_host(host);
+ err = mmc_resume_clk_scaling(host);
+ if (err)
+ pr_err("%s: %s: fail to resume clock scaling (%d)\n",
+ mmc_hostname(host), __func__, err);
+out:
return err;
}

@@ -2286,6 +2312,12 @@ static int mmc_shutdown(struct mmc_host *host)
if (mmc_can_poweroff_notify(host->card) &&
!(host->caps2 & MMC_CAP2_FULL_PWR_CYCLE))
err = _mmc_resume(host);
+ /*
+ * Exit clock scaling so that it doesn't kick in after
+ * power off notification is sent
+ */
+ if (host->caps2 & MMC_CAP2_CLK_SCALE)
+ mmc_exit_clk_scaling(host);

if (!err)
err = _mmc_suspend(host, false);
@@ -2348,6 +2380,7 @@ static int mmc_can_reset(struct mmc_card *card)
static int _mmc_hw_reset(struct mmc_host *host)
{
struct mmc_card *card = host->card;
+ int ret;

/*
* In the case of recovery, we can't expect flushing the cache to work
@@ -2367,7 +2400,15 @@ static int _mmc_hw_reset(struct mmc_host *host)
mmc_power_cycle(host, card->ocr);
mmc_pwrseq_reset(host);
}
- return mmc_init_card(host, card->ocr, card);
+
+ ret = mmc_init_card(host, host->card->ocr, host->card);
+ if (ret) {
+ pr_err("%s: %s: mmc_init_card failed (%d)\n",
+ mmc_hostname(host), __func__, ret);
+ return ret;
+ }
+
+ return ret;
}

static const struct mmc_bus_ops mmc_ops = {
@@ -2380,6 +2421,7 @@ static int _mmc_hw_reset(struct mmc_host *host)
.alive = mmc_alive,
.shutdown = mmc_shutdown,
.hw_reset = _mmc_hw_reset,
+ .change_bus_speed = mmc_change_bus_speed,
};

/*
@@ -2436,6 +2478,12 @@ int mmc_attach_mmc(struct mmc_host *host)
goto remove_card;

mmc_claim_host(host);
+ err = mmc_init_clk_scaling(host);
+ if (err) {
+ mmc_release_host(host);
+ goto remove_card;
+ }
+
return 0;

remove_card:
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 9edc086..da22510 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -296,6 +296,7 @@ static blk_status_t mmc_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
mq->busy = true;

mq->in_flight[issue_type] += 1;
+ atomic_inc(&host->active_reqs);
get_card = (mmc_tot_in_flight(mq) == 1);
cqe_retune_ok = (mmc_cqe_qcnt(mq) == 1);

@@ -335,6 +336,7 @@ static blk_status_t mmc_mq_queue_rq(struct blk_mq_hw_ctx *hctx,

spin_lock_irq(&mq->lock);
mq->in_flight[issue_type] -= 1;
+ atomic_dec(&host->active_reqs);
if (mmc_tot_in_flight(mq) == 0)
put_card = true;
mq->busy = false;
diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c
index fc78c09..54b8d92 100644
--- a/drivers/mmc/core/sd.c
+++ b/drivers/mmc/core/sd.c
@@ -645,7 +645,6 @@ static int mmc_sd_change_bus_speed(struct mmc_host *host, unsigned long *freq)
return err;
}

-
/*
* UHS-I specific initialization procedure
*/
@@ -966,7 +965,10 @@ unsigned mmc_sd_get_max_clock(struct mmc_card *card)
{
unsigned max_dtr = (unsigned int)-1;

- if (mmc_card_hs(card)) {
+ if (mmc_card_uhs(card)) {
+ if (max_dtr > card->sw_caps.uhs_max_dtr)
+ max_dtr = card->sw_caps.uhs_max_dtr;
+ } else if (mmc_card_hs(card)) {
if (max_dtr > card->sw_caps.hs_max_dtr)
max_dtr = card->sw_caps.hs_max_dtr;
} else if (max_dtr > card->csd.max_dtr) {
@@ -1143,6 +1145,9 @@ static int mmc_sd_init_card(struct mmc_host *host, u32 ocr,
err = -EINVAL;
goto free_card;
}
+
+ card->clk_scaling_highest = mmc_sd_get_max_clock(card);
+ card->clk_scaling_lowest = host->f_min;
done:
host->card = card;
return 0;
@@ -1159,6 +1164,7 @@ static int mmc_sd_init_card(struct mmc_host *host, u32 ocr,
*/
static void mmc_sd_remove(struct mmc_host *host)
{
+ mmc_exit_clk_scaling(host);
mmc_remove_card(host->card);
host->card = NULL;
}
@@ -1201,6 +1207,13 @@ static int _mmc_sd_suspend(struct mmc_host *host)
{
int err = 0;

+ err = mmc_suspend_clk_scaling(host);
+ if (err) {
+ pr_err("%s: %s: fail to suspend clock scaling (%d)\n",
+ mmc_hostname(host), __func__, err);
+ return err;
+ }
+
mmc_claim_host(host);

if (mmc_card_suspended(host->card))
@@ -1252,6 +1265,13 @@ static int _mmc_sd_resume(struct mmc_host *host)
err = mmc_sd_init_card(host, host->card->ocr, host->card);
mmc_card_clr_suspended(host->card);

+ err = mmc_resume_clk_scaling(host);
+ if (err) {
+ pr_err("%s: %s: fail to resume clock scaling (%d)\n",
+ mmc_hostname(host), __func__, err);
+ goto out;
+ }
+
out:
mmc_release_host(host);
return err;
@@ -1315,6 +1335,7 @@ static int mmc_sd_hw_reset(struct mmc_host *host)
.alive = mmc_sd_alive,
.shutdown = mmc_sd_suspend,
.hw_reset = mmc_sd_hw_reset,
+ .change_bus_speed = mmc_sd_change_bus_speed,
};

/*
@@ -1375,6 +1396,13 @@ int mmc_attach_sd(struct mmc_host *host)
goto remove_card;

mmc_claim_host(host);
+
+ err = mmc_init_clk_scaling(host);
+ if (err) {
+ mmc_release_host(host);
+ goto remove_card;
+ }
+
return 0;

remove_card:
--
1.9.1

2019-10-21 14:37:43

by Ram Prakash Gupta

[permalink] [raw]
Subject: [RFC 2/6] mmc: core: Add core scaling support in driver

Adding clock scaling support in clock scaling driver.

Signed-off-by: Sujit Reddy Thumma <[email protected]>
Signed-off-by: Talel Shenhar <[email protected]>
Signed-off-by: Sahitya Tummala <[email protected]>
Signed-off-by: Asutosh Das <[email protected]>
Signed-off-by: Ritesh Harjani <[email protected]>
Signed-off-by: Veerabhadrarao Badiganti <[email protected]>
Signed-off-by: Bao D. Nguyen <[email protected]>
Signed-off-by: Can Guo <[email protected]>
Signed-off-by: Sayali Lokhande <[email protected]>
Co-Developed-by: Ram Prakash Gupta <[email protected]>
Signed-off-by: Ram Prakash Gupta <[email protected]>
---
drivers/mmc/core/core.c | 769 ++++++++++++++++++++++++++++++++++++++++++++++++
drivers/mmc/core/core.h | 17 ++
drivers/mmc/core/mmc.c | 192 ++++++++++++
drivers/mmc/core/sd.c | 54 ++++
4 files changed, 1032 insertions(+)

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 2211273..f0c233c 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -11,6 +11,7 @@
#include <linux/init.h>
#include <linux/interrupt.h>
#include <linux/completion.h>
+#include <linux/devfreq.h>
#include <linux/device.h>
#include <linux/delay.h>
#include <linux/pagemap.h>
@@ -39,6 +40,7 @@
#include "card.h"
#include "bus.h"
#include "host.h"
+#include "queue.h"
#include "sdio_bus.h"
#include "pwrseq.h"

@@ -109,6 +111,773 @@ static inline void mmc_should_fail_request(struct mmc_host *host,

#endif /* CONFIG_FAIL_MMC_REQUEST */

+static bool mmc_is_data_request(struct mmc_request *mmc_request)
+{
+ switch (mmc_request->cmd->opcode) {
+ case MMC_READ_SINGLE_BLOCK:
+ case MMC_READ_MULTIPLE_BLOCK:
+ case MMC_WRITE_BLOCK:
+ case MMC_WRITE_MULTIPLE_BLOCK:
+ return true;
+ default:
+ return false;
+ }
+}
+
+static void mmc_clk_scaling_start_busy(struct mmc_host *host, bool lock_needed)
+{
+ struct mmc_devfeq_clk_scaling *clk_scaling = &host->clk_scaling;
+
+ if (!clk_scaling->enable)
+ return;
+
+ if (lock_needed)
+ spin_lock_bh(&clk_scaling->lock);
+
+ clk_scaling->start_busy = ktime_get();
+ clk_scaling->is_busy_started = true;
+
+ if (lock_needed)
+ spin_unlock_bh(&clk_scaling->lock);
+}
+
+static void mmc_clk_scaling_stop_busy(struct mmc_host *host, bool lock_needed)
+{
+ struct mmc_devfeq_clk_scaling *clk_scaling = &host->clk_scaling;
+
+ if (!clk_scaling->enable)
+ return;
+
+ if (lock_needed)
+ spin_lock_bh(&clk_scaling->lock);
+
+ if (!clk_scaling->is_busy_started) {
+ WARN_ON(1);
+ goto out;
+ }
+
+ clk_scaling->total_busy_time_us +=
+ ktime_to_us(ktime_sub(ktime_get(),
+ clk_scaling->start_busy));
+ pr_debug("%s: accumulated busy time is %lu usec\n",
+ mmc_hostname(host), clk_scaling->total_busy_time_us);
+ clk_scaling->is_busy_started = false;
+
+out:
+ if (lock_needed)
+ spin_unlock_bh(&clk_scaling->lock);
+}
+
+/* mmc_cqe_clk_scaling_start_busy() - start busy timer for data requests
+ * @host: pointer to mmc host structure
+ * @lock_needed: flag indication if locking is needed
+ *
+ * This function starts the busy timer in case it was not already started.
+ */
+void mmc_cqe_clk_scaling_start_busy(struct mmc_queue *mq,
+ struct mmc_host *host, bool lock_needed)
+{
+ unsigned long flags;
+
+ if (!host->clk_scaling.enable)
+ return;
+
+ if (lock_needed)
+ spin_lock_irqsave(&host->clk_scaling.lock, flags);
+
+ if (!host->clk_scaling.is_busy_started &&
+ !(mq->cqe_busy & MMC_CQE_DCMD_BUSY)) {
+ host->clk_scaling.start_busy = ktime_get();
+ host->clk_scaling.is_busy_started = true;
+ }
+
+ if (lock_needed)
+ spin_unlock_irqrestore(&host->clk_scaling.lock, flags);
+}
+EXPORT_SYMBOL(mmc_cqe_clk_scaling_start_busy);
+
+/**
+ * mmc_cqe_clk_scaling_stop_busy() - stop busy timer for last data requests
+ * @host: pointer to mmc host structure
+ * @lock_needed: flag indication if locking is needed
+ *
+ * This function stops the busy timer in case it is the last data request.
+ * In case the current request is not the last one, the busy time till
+ * now will be accumulated and the counter will be restarted.
+ */
+void mmc_cqe_clk_scaling_stop_busy(struct mmc_host *host,
+ bool lock_needed, bool is_cqe_dcmd)
+{
+ unsigned int cqe_active_reqs = 0;
+
+ if (!host->clk_scaling.enable)
+ return;
+
+ cqe_active_reqs = atomic_read(&host->active_reqs);
+
+ /*
+ * This gets invoked from CQE completion path which is hard IRQ context
+ * So use spin_lock() instread of spin_lock_irqsave()
+ */
+ if (lock_needed)
+ spin_lock(&host->clk_scaling.lock);
+
+ /*
+ * For CQ mode: In completion of DCMD request, start busy time in
+ * case of pending data requests
+ */
+ if (is_cqe_dcmd) {
+ if (cqe_active_reqs && !host->clk_scaling.is_busy_started) {
+ host->clk_scaling.is_busy_started = true;
+ host->clk_scaling.start_busy = ktime_get();
+ }
+ goto out;
+ }
+
+ host->clk_scaling.total_busy_time_us +=
+ ktime_to_us(ktime_sub(ktime_get(),
+ host->clk_scaling.start_busy));
+
+ if (cqe_active_reqs) {
+ host->clk_scaling.is_busy_started = true;
+ host->clk_scaling.start_busy = ktime_get();
+ } else {
+ host->clk_scaling.is_busy_started = false;
+ }
+out:
+ if (lock_needed)
+ spin_unlock(&host->clk_scaling.lock);
+
+}
+EXPORT_SYMBOL(mmc_cqe_clk_scaling_stop_busy);
+
+/**
+ * mmc_can_scale_clk() - Check clock scaling capability
+ * @host: pointer to mmc host structure
+ */
+bool mmc_can_scale_clk(struct mmc_host *host)
+{
+ if (!host) {
+ pr_err("bad host parameter\n");
+ WARN_ON(1);
+ return false;
+ }
+
+ return host->caps2 & MMC_CAP2_CLK_SCALE;
+}
+EXPORT_SYMBOL(mmc_can_scale_clk);
+
+static int mmc_devfreq_get_dev_status(struct device *dev,
+ struct devfreq_dev_status *status)
+{
+ struct mmc_host *host = container_of(dev, struct mmc_host, class_dev);
+ struct mmc_devfeq_clk_scaling *clk_scaling;
+ unsigned long flags;
+
+ if (!host) {
+ pr_err("bad host parameter\n");
+ WARN_ON(1);
+ return -EINVAL;
+ }
+
+ clk_scaling = &host->clk_scaling;
+
+ if (!clk_scaling->enable)
+ return 0;
+
+ spin_lock_irqsave(&host->clk_scaling.lock, flags);
+
+ /* accumulate the busy time of ongoing work */
+ memset(status, 0, sizeof(*status));
+ if (clk_scaling->is_busy_started) {
+ if (host->cqe_on) {
+ /* the "busy-timer" will be restarted in case there
+ * are pending data requests
+ */
+ mmc_cqe_clk_scaling_stop_busy(host, false, false);
+ } else {
+ mmc_clk_scaling_stop_busy(host, false);
+ mmc_clk_scaling_start_busy(host, false);
+ }
+ }
+
+ status->busy_time = clk_scaling->total_busy_time_us;
+ status->total_time = ktime_to_us(ktime_sub(ktime_get(),
+ clk_scaling->measure_interval_start));
+ clk_scaling->total_busy_time_us = 0;
+ status->current_frequency = clk_scaling->curr_freq;
+ clk_scaling->measure_interval_start = ktime_get();
+
+ pr_debug("%s: status: load = %lu%% - total_time=%lu busy_time = %lu, clk=%lu\n",
+ mmc_hostname(host),
+ (status->busy_time*100)/status->total_time,
+ status->total_time, status->busy_time,
+ status->current_frequency);
+
+ spin_unlock_irqrestore(&host->clk_scaling.lock, flags);
+
+ return 0;
+}
+
+static bool mmc_is_valid_state_for_clk_scaling(struct mmc_host *host)
+{
+ struct mmc_card *card = host->card;
+ u32 status;
+
+ /*
+ * If the current partition type is RPMB, clock switching may not
+ * work properly as sending tuning command (CMD21) is illegal in
+ * this mode.
+ */
+ if (!card || (mmc_card_mmc(card) &&
+ (card->part_curr == EXT_CSD_PART_CONFIG_ACC_RPMB)))
+ return false;
+
+ if (mmc_send_status(card, &status)) {
+ pr_err("%s: Get card status fail\n", mmc_hostname(card->host));
+ return false;
+ }
+
+ return R1_CURRENT_STATE(status) == R1_STATE_TRAN;
+}
+
+int mmc_clk_update_freq(struct mmc_host *host,
+ unsigned long freq, enum mmc_load state)
+{
+ int err = 0;
+
+ if (!host) {
+ pr_err("bad host parameter\n");
+ WARN_ON(1);
+ return -EINVAL;
+ }
+
+ /* make sure the card supports the frequency we want */
+ if (unlikely(freq > host->card->clk_scaling_highest)) {
+ freq = host->card->clk_scaling_highest;
+ pr_warn("%s: %s: High freq was overridden to %lu\n",
+ mmc_hostname(host), __func__,
+ host->card->clk_scaling_highest);
+ }
+
+ if (unlikely(freq < host->card->clk_scaling_lowest)) {
+ freq = host->card->clk_scaling_lowest;
+ pr_warn("%s: %s: Low freq was overridden to %lu\n",
+ mmc_hostname(host), __func__,
+ host->card->clk_scaling_lowest);
+ }
+
+ if (freq == host->clk_scaling.curr_freq)
+ goto out;
+
+ if (host->cqe_on) {
+ err = host->cqe_ops->cqe_wait_for_idle(host);
+ if (err) {
+ pr_err("%s: %s: CQE went in recovery path\n",
+ mmc_hostname(host), __func__);
+ goto out;
+ }
+ host->cqe_ops->cqe_off(host);
+ }
+
+ if (!mmc_is_valid_state_for_clk_scaling(host)) {
+ pr_debug("%s: invalid state for clock scaling - skipping\n",
+ mmc_hostname(host));
+ goto out;
+ }
+
+ err = host->bus_ops->change_bus_speed(host, &freq);
+ if (!err)
+ host->clk_scaling.curr_freq = freq;
+ else
+ pr_err("%s: %s: failed (%d) at freq=%lu\n",
+ mmc_hostname(host), __func__, err, freq);
+
+ /*
+ * CQE would be enabled as part of CQE issueing path
+ * So no need to unhalt it explicitly
+ */
+
+out:
+ return err;
+}
+EXPORT_SYMBOL(mmc_clk_update_freq);
+
+static int mmc_devfreq_set_target(struct device *dev,
+ unsigned long *freq, u32 devfreq_flags)
+{
+ struct mmc_host *host = container_of(dev, struct mmc_host, class_dev);
+ struct mmc_devfeq_clk_scaling *clk_scaling;
+ int err = 0;
+ int abort;
+ unsigned long pflags = current->flags;
+ unsigned long flags;
+
+ /* Ensure scaling would happen even in memory pressure conditions */
+ current->flags |= PF_MEMALLOC;
+
+ if (!(host && freq)) {
+ pr_err("%s: unexpected host/freq parameter\n", __func__);
+ err = -EINVAL;
+ goto out;
+ }
+
+ clk_scaling = &host->clk_scaling;
+
+ if (!clk_scaling->enable)
+ goto out;
+
+ pr_debug("%s: target freq = %lu (%s)\n", mmc_hostname(host),
+ *freq, current->comm);
+
+ spin_lock_irqsave(&clk_scaling->lock, flags);
+ if (clk_scaling->target_freq == *freq ||
+ clk_scaling->skip_clk_scale_freq_update) {
+ spin_unlock_irqrestore(&clk_scaling->lock, flags);
+ goto out;
+ }
+
+ clk_scaling->need_freq_change = true;
+ clk_scaling->target_freq = *freq;
+ clk_scaling->state = *freq < clk_scaling->curr_freq ?
+ MMC_LOAD_LOW : MMC_LOAD_HIGH;
+ spin_unlock_irqrestore(&clk_scaling->lock, flags);
+
+ if (!clk_scaling->is_suspended && host->ios.clock)
+ abort = __mmc_claim_host(host, NULL,
+ &clk_scaling->devfreq_abort);
+ else
+ goto out;
+
+ if (abort)
+ goto out;
+
+ /*
+ * In case we were able to claim host there is no need to
+ * defer the frequency change. It will be done now
+ */
+ clk_scaling->need_freq_change = false;
+
+ err = mmc_clk_update_freq(host, *freq, clk_scaling->state);
+ if (err && err != -EAGAIN)
+ pr_err("%s: clock scale to %lu failed with error %d\n",
+ mmc_hostname(host), *freq, err);
+ else
+ pr_debug("%s: clock change to %lu finished successfully (%s)\n",
+ mmc_hostname(host), *freq, current->comm);
+
+ mmc_release_host(host);
+out:
+ current_restore_flags(pflags, PF_MEMALLOC);
+ return err;
+}
+
+/**
+ * mmc_deferred_scaling() - scale clocks from data path (mmc thread context)
+ * @host: pointer to mmc host structure
+ *
+ * This function does clock scaling in case "need_freq_change" flag was set
+ * by the clock scaling logic.
+ */
+void mmc_deferred_scaling(struct mmc_host *host)
+{
+ unsigned long target_freq;
+ int err;
+ struct mmc_devfeq_clk_scaling clk_scaling;
+ unsigned long flags;
+
+ if (!host->clk_scaling.enable)
+ return;
+
+ spin_lock_irqsave(&host->clk_scaling.lock, flags);
+
+ if (!host->clk_scaling.need_freq_change) {
+ spin_unlock_irqrestore(&host->clk_scaling.lock, flags);
+ return;
+ }
+
+ atomic_inc(&host->clk_scaling.devfreq_abort);
+ target_freq = host->clk_scaling.target_freq;
+ /*
+ * Store the clock scaling state while the lock is acquired so that
+ * if devfreq context modifies clk_scaling, it will get reflected only
+ * in the next deferred scaling check.
+ */
+ clk_scaling = host->clk_scaling;
+ host->clk_scaling.need_freq_change = false;
+ spin_unlock_irqrestore(&host->clk_scaling.lock, flags);
+
+ pr_debug("%s: doing deferred frequency change (%lu) (%s)\n",
+ mmc_hostname(host),
+ target_freq, current->comm);
+
+ err = mmc_clk_update_freq(host, target_freq,
+ clk_scaling.state);
+ if (err && err != -EAGAIN)
+ pr_err("%s: failed on deferred scale clocks (%d)\n",
+ mmc_hostname(host), err);
+ else
+ pr_debug("%s: clocks were successfully scaled to %lu (%s)\n",
+ mmc_hostname(host),
+ target_freq, current->comm);
+ atomic_dec(&host->clk_scaling.devfreq_abort);
+}
+EXPORT_SYMBOL(mmc_deferred_scaling);
+
+static int mmc_devfreq_create_freq_table(struct mmc_host *host)
+{
+ int i;
+ struct mmc_devfeq_clk_scaling *clk_scaling = &host->clk_scaling;
+
+ pr_debug("%s: supported: lowest=%lu, highest=%lu\n",
+ mmc_hostname(host),
+ host->card->clk_scaling_lowest,
+ host->card->clk_scaling_highest);
+
+ /*
+ * Create the frequency table and initialize it with default values.
+ * Initialize it with platform specific frequencies if the frequency
+ * table supplied by platform driver is present, otherwise initialize
+ * it with min and max frequencies supported by the card.
+ */
+ if (!clk_scaling->freq_table) {
+ if (clk_scaling->pltfm_freq_table_sz)
+ clk_scaling->freq_table_sz =
+ clk_scaling->pltfm_freq_table_sz;
+ else
+ clk_scaling->freq_table_sz = 2;
+
+ clk_scaling->freq_table = kcalloc(
+ clk_scaling->freq_table_sz,
+ sizeof(*(clk_scaling->freq_table)), GFP_KERNEL);
+ if (!clk_scaling->freq_table)
+ return -ENOMEM;
+
+ if (clk_scaling->pltfm_freq_table) {
+ memcpy(clk_scaling->freq_table,
+ clk_scaling->pltfm_freq_table,
+ (clk_scaling->pltfm_freq_table_sz *
+ sizeof(*(clk_scaling->pltfm_freq_table))));
+ } else {
+ pr_debug("%s: no frequency table defined - setting default\n",
+ mmc_hostname(host));
+ clk_scaling->freq_table[0] =
+ host->card->clk_scaling_lowest;
+ clk_scaling->freq_table[1] =
+ host->card->clk_scaling_highest;
+ goto out;
+ }
+ }
+
+ if (host->card->clk_scaling_lowest >
+ clk_scaling->freq_table[0])
+ pr_debug("%s: frequency table undershot possible freq\n",
+ mmc_hostname(host));
+
+ for (i = 0; i < clk_scaling->freq_table_sz; i++) {
+ if (clk_scaling->freq_table[i] <=
+ host->card->clk_scaling_highest)
+ continue;
+ clk_scaling->freq_table[i] =
+ host->card->clk_scaling_highest;
+ clk_scaling->freq_table_sz = i + 1;
+ pr_debug("%s: frequency table overshot possible freq (%d)\n",
+ mmc_hostname(host), clk_scaling->freq_table[i]);
+ break;
+ }
+
+out:
+ /**
+ * devfreq requires unsigned long type freq_table while the
+ * freq_table in clk_scaling is un32. Here allocates an individual
+ * memory space for it and release it when exit clock scaling.
+ */
+ clk_scaling->devfreq_profile.freq_table = kcalloc(
+ clk_scaling->freq_table_sz,
+ sizeof(*(clk_scaling->devfreq_profile.freq_table)),
+ GFP_KERNEL);
+ if (!clk_scaling->devfreq_profile.freq_table) {
+ kfree(clk_scaling->freq_table);
+ return -ENOMEM;
+ }
+ clk_scaling->devfreq_profile.max_state = clk_scaling->freq_table_sz;
+
+ for (i = 0; i < clk_scaling->freq_table_sz; i++) {
+ clk_scaling->devfreq_profile.freq_table[i] =
+ clk_scaling->freq_table[i];
+ pr_debug("%s: freq[%d] = %u\n",
+ mmc_hostname(host), i, clk_scaling->freq_table[i]);
+ }
+
+ return 0;
+}
+
+/**
+ * mmc_init_devfreq_clk_scaling() - Initialize clock scaling
+ * @host: pointer to mmc host structure
+ *
+ * Initialize clock scaling for supported hosts. It is assumed that the caller
+ * ensure clock is running at maximum possible frequency before calling this
+ * function. Shall use struct devfreq_simple_ondemand_data to configure
+ * governor.
+ */
+int mmc_init_clk_scaling(struct mmc_host *host)
+{
+ int err;
+ struct devfreq *devfreq;
+
+ if (!host || !host->card) {
+ pr_err("%s: unexpected host/card parameters\n",
+ __func__);
+ return -EINVAL;
+ }
+
+ if (!mmc_can_scale_clk(host) ||
+ !host->bus_ops->change_bus_speed) {
+ pr_debug("%s: clock scaling is not supported\n",
+ mmc_hostname(host));
+ return 0;
+ }
+
+ pr_debug("registering %s dev (%pK) to devfreq\n",
+ mmc_hostname(host),
+ mmc_classdev(host));
+
+ if (host->clk_scaling.devfreq) {
+ pr_err("%s: dev is already registered for dev %pK\n",
+ mmc_hostname(host),
+ mmc_dev(host));
+ return -EPERM;
+ }
+ spin_lock_init(&host->clk_scaling.lock);
+ atomic_set(&host->clk_scaling.devfreq_abort, 0);
+ host->clk_scaling.curr_freq = host->ios.clock;
+ host->clk_scaling.need_freq_change = false;
+ host->clk_scaling.is_busy_started = false;
+
+ host->clk_scaling.devfreq_profile.polling_ms =
+ host->clk_scaling.polling_delay_ms;
+ host->clk_scaling.devfreq_profile.get_dev_status =
+ mmc_devfreq_get_dev_status;
+ host->clk_scaling.devfreq_profile.target = mmc_devfreq_set_target;
+ host->clk_scaling.devfreq_profile.initial_freq = host->ios.clock;
+
+ host->clk_scaling.ondemand_gov_data.upthreshold =
+ host->clk_scaling.upthreshold;
+ host->clk_scaling.ondemand_gov_data.downdifferential =
+ host->clk_scaling.upthreshold - host->clk_scaling.downthreshold;
+
+ err = mmc_devfreq_create_freq_table(host);
+ if (err) {
+ pr_err("%s: fail to create devfreq frequency table\n",
+ mmc_hostname(host));
+ return err;
+ }
+
+ dev_pm_opp_add(mmc_classdev(host),
+ host->clk_scaling.devfreq_profile.freq_table[0], 0);
+ dev_pm_opp_add(mmc_classdev(host),
+ host->clk_scaling.devfreq_profile.freq_table[1], 0);
+
+ pr_debug("%s: adding devfreq with: upthreshold=%u downthreshold=%u polling=%u\n",
+ mmc_hostname(host),
+ host->clk_scaling.ondemand_gov_data.upthreshold,
+ host->clk_scaling.ondemand_gov_data.downdifferential,
+ host->clk_scaling.devfreq_profile.polling_ms);
+
+ devfreq = devfreq_add_device(
+ mmc_classdev(host),
+ &host->clk_scaling.devfreq_profile,
+ "simple_ondemand",
+ &host->clk_scaling.ondemand_gov_data);
+
+ if (IS_ERR(devfreq)) {
+ pr_err("%s: unable to register with devfreq\n",
+ mmc_hostname(host));
+ dev_pm_opp_remove(mmc_classdev(host),
+ host->clk_scaling.devfreq_profile.freq_table[0]);
+ dev_pm_opp_remove(mmc_classdev(host),
+ host->clk_scaling.devfreq_profile.freq_table[1]);
+ return PTR_ERR(devfreq);
+ }
+
+ host->clk_scaling.devfreq = devfreq;
+ pr_debug("%s: clk scaling is enabled for device %s (%pK) with devfreq %pK (clock = %uHz)\n",
+ mmc_hostname(host),
+ dev_name(mmc_classdev(host)),
+ mmc_classdev(host),
+ host->clk_scaling.devfreq,
+ host->ios.clock);
+
+ host->clk_scaling.enable = true;
+
+ return err;
+}
+EXPORT_SYMBOL(mmc_init_clk_scaling);
+
+/**
+ * mmc_suspend_clk_scaling() - suspend clock scaling
+ * @host: pointer to mmc host structure
+ *
+ * This API will suspend devfreq feature for the specific host.
+ * The statistics collected by mmc will be cleared.
+ * This function is intended to be called by the pm callbacks
+ * (e.g. runtime_suspend, suspend) of the mmc device
+ */
+int mmc_suspend_clk_scaling(struct mmc_host *host)
+{
+ int err;
+
+ if (!host) {
+ WARN(1, "bad host parameter\n");
+ return -EINVAL;
+ }
+
+ if (!mmc_can_scale_clk(host) || !host->clk_scaling.enable ||
+ host->clk_scaling.is_suspended)
+ return 0;
+
+ if (!host->clk_scaling.devfreq) {
+ pr_err("%s: %s: no devfreq is assosiated with this device\n",
+ mmc_hostname(host), __func__);
+ return -EPERM;
+ }
+
+ atomic_inc(&host->clk_scaling.devfreq_abort);
+ wake_up(&host->wq);
+ err = devfreq_suspend_device(host->clk_scaling.devfreq);
+ if (err) {
+ pr_err("%s: %s: failed to suspend devfreq\n",
+ mmc_hostname(host), __func__);
+ return err;
+ }
+ host->clk_scaling.is_suspended = true;
+
+ host->clk_scaling.total_busy_time_us = 0;
+
+ pr_debug("%s: devfreq was removed\n", mmc_hostname(host));
+
+ return 0;
+}
+EXPORT_SYMBOL(mmc_suspend_clk_scaling);
+
+/**
+ * mmc_resume_clk_scaling() - resume clock scaling
+ * @host: pointer to mmc host structure
+ *
+ * This API will resume devfreq feature for the specific host.
+ * This API is intended to be called by the pm callbacks
+ * (e.g. runtime_suspend, suspend) of the mmc device
+ */
+int mmc_resume_clk_scaling(struct mmc_host *host)
+{
+ int err = 0;
+ u32 max_clk_idx = 0;
+ u32 devfreq_max_clk = 0;
+ u32 devfreq_min_clk = 0;
+
+ if (!host) {
+ WARN(1, "bad host parameter\n");
+ return -EINVAL;
+ }
+
+ if (!mmc_can_scale_clk(host))
+ return 0;
+
+ /*
+ * If clock scaling is already exited when resume is called, like
+ * during mmc shutdown, it is not an error and should not fail the
+ * API calling this.
+ */
+ if (!host->clk_scaling.devfreq) {
+ pr_warn("%s: %s: no devfreq is assosiated with this device\n",
+ mmc_hostname(host), __func__);
+ return 0;
+ }
+
+ atomic_set(&host->clk_scaling.devfreq_abort, 0);
+
+ max_clk_idx = host->clk_scaling.freq_table_sz - 1;
+ devfreq_max_clk = host->clk_scaling.freq_table[max_clk_idx];
+ devfreq_min_clk = host->clk_scaling.freq_table[0];
+
+ host->clk_scaling.curr_freq = devfreq_max_clk;
+ if (host->ios.clock < host->clk_scaling.freq_table[max_clk_idx])
+ host->clk_scaling.curr_freq = devfreq_min_clk;
+ host->clk_scaling.target_freq = host->clk_scaling.curr_freq;
+
+ err = devfreq_resume_device(host->clk_scaling.devfreq);
+ if (err) {
+ pr_err("%s: %s: failed to resume devfreq (%d)\n",
+ mmc_hostname(host), __func__, err);
+ } else {
+ host->clk_scaling.is_suspended = false;
+ pr_debug("%s: devfreq resumed\n", mmc_hostname(host));
+ }
+
+ return err;
+}
+EXPORT_SYMBOL(mmc_resume_clk_scaling);
+
+/**
+ * mmc_exit_devfreq_clk_scaling() - Disable clock scaling
+ * @host: pointer to mmc host structure
+ *
+ * Disable clock scaling permanently.
+ */
+int mmc_exit_clk_scaling(struct mmc_host *host)
+{
+ int err;
+
+ if (!host) {
+ pr_err("%s: bad host parameter\n", __func__);
+ WARN_ON(1);
+ return -EINVAL;
+ }
+
+ if (!mmc_can_scale_clk(host))
+ return 0;
+
+ if (!host->clk_scaling.devfreq) {
+ pr_err("%s: %s: no devfreq is assosiated with this device\n",
+ mmc_hostname(host), __func__);
+ return -EPERM;
+ }
+
+ err = mmc_suspend_clk_scaling(host);
+ if (err) {
+ pr_err("%s: %s: fail to suspend clock scaling (%d)\n",
+ mmc_hostname(host), __func__, err);
+ return err;
+ }
+
+ err = devfreq_remove_device(host->clk_scaling.devfreq);
+ if (err) {
+ pr_err("%s: remove devfreq failed (%d)\n",
+ mmc_hostname(host), err);
+ return err;
+ }
+
+ dev_pm_opp_remove(mmc_classdev(host),
+ host->clk_scaling.devfreq_profile.freq_table[0]);
+ dev_pm_opp_remove(mmc_classdev(host),
+ host->clk_scaling.devfreq_profile.freq_table[1]);
+
+ kfree(host->clk_scaling.devfreq_profile.freq_table);
+
+ host->clk_scaling.devfreq = NULL;
+ atomic_set(&host->clk_scaling.devfreq_abort, 1);
+
+ kfree(host->clk_scaling.freq_table);
+ host->clk_scaling.freq_table = NULL;
+
+ pr_debug("%s: devfreq was removed\n", mmc_hostname(host));
+
+ return 0;
+}
+EXPORT_SYMBOL(mmc_exit_clk_scaling);
+
+
static inline void mmc_complete_cmd(struct mmc_request *mrq)
{
if (mrq->cap_cmd_during_tfr && !completion_done(&mrq->cmd_completion))
diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
index 328c78d..2abf75a 100644
--- a/drivers/mmc/core/core.h
+++ b/drivers/mmc/core/core.h
@@ -14,6 +14,7 @@
struct mmc_host;
struct mmc_card;
struct mmc_request;
+struct mmc_queue;

#define MMC_CMD_RETRIES 3

@@ -29,6 +30,7 @@ struct mmc_bus_ops {
int (*shutdown)(struct mmc_host *);
int (*hw_reset)(struct mmc_host *);
int (*sw_reset)(struct mmc_host *);
+ int (*change_bus_speed)(struct mmc_host *host, unsigned long *freq);
};

void mmc_attach_bus(struct mmc_host *host, const struct mmc_bus_ops *ops);
@@ -57,6 +59,8 @@ int mmc_select_drive_strength(struct mmc_card *card, unsigned int max_dtr,
void mmc_power_cycle(struct mmc_host *host, u32 ocr);
void mmc_set_initial_state(struct mmc_host *host);
u32 mmc_vddrange_to_ocrmask(int vdd_min, int vdd_max);
+int mmc_clk_update_freq(struct mmc_host *host,
+ unsigned long freq, enum mmc_load state);

static inline void mmc_delay(unsigned int ms)
{
@@ -87,6 +91,19 @@ static inline void mmc_delay(unsigned int ms)
void mmc_add_card_debugfs(struct mmc_card *card);
void mmc_remove_card_debugfs(struct mmc_card *card);

+extern bool mmc_can_scale_clk(struct mmc_host *host);
+extern int mmc_init_clk_scaling(struct mmc_host *host);
+extern int mmc_suspend_clk_scaling(struct mmc_host *host);
+extern int mmc_resume_clk_scaling(struct mmc_host *host);
+extern int mmc_exit_clk_scaling(struct mmc_host *host);
+extern void mmc_deferred_scaling(struct mmc_host *host);
+extern void mmc_cqe_clk_scaling_start_busy(struct mmc_queue *mq,
+ struct mmc_host *host, bool lock_needed);
+extern void mmc_cqe_clk_scaling_stop_busy(struct mmc_host *host,
+ bool lock_needed, bool is_cqe_dcmd);
+
+extern unsigned long mmc_get_max_frequency(struct mmc_host *host);
+
int mmc_execute_tuning(struct mmc_card *card);
int mmc_hs200_to_hs400(struct mmc_card *card);
int mmc_hs400_to_hs200(struct mmc_card *card);
diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index c880489..c9fccfc 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -1539,6 +1539,198 @@ static int mmc_hs200_tuning(struct mmc_card *card)
return mmc_execute_tuning(card);
}

+static int mmc_select_hs_ddr52(struct mmc_host *host)
+{
+ int err;
+
+ mmc_select_hs(host->card);
+ err = mmc_select_bus_width(host->card);
+ if (err < 0) {
+ pr_err("%s: %s: select_bus_width failed(%d)\n",
+ mmc_hostname(host), __func__, err);
+ return err;
+ }
+
+ err = mmc_select_hs_ddr(host->card);
+ mmc_set_clock(host, MMC_HIGH_52_MAX_DTR);
+
+ return err;
+}
+
+/*
+ * Scale down from HS400 to HS in order to allow frequency change.
+ * This is needed for cards that doesn't support changing frequency in HS400
+ */
+static int mmc_scale_low(struct mmc_host *host, unsigned long freq)
+{
+ int err = 0;
+
+ mmc_set_timing(host, MMC_TIMING_LEGACY);
+ mmc_set_clock(host, MMC_HIGH_26_MAX_DTR);
+
+ if (host->clk_scaling.lower_bus_speed_mode &
+ MMC_SCALING_LOWER_DDR52_MODE) {
+ err = mmc_select_hs_ddr52(host);
+ if (err)
+ pr_err("%s: %s: failed to switch to DDR52: err: %d\n",
+ mmc_hostname(host), __func__, err);
+ else
+ return err;
+ }
+
+ err = mmc_select_hs(host->card);
+ if (err) {
+ pr_err("%s: %s: scaling low: failed (%d)\n",
+ mmc_hostname(host), __func__, err);
+ return err;
+ }
+
+ err = mmc_select_bus_width(host->card);
+ if (err < 0) {
+ pr_err("%s: %s: select_bus_width failed(%d)\n",
+ mmc_hostname(host), __func__, err);
+ return err;
+ }
+
+ mmc_set_clock(host, freq);
+
+ return 0;
+}
+
+/*
+ * Scale UP from HS to HS200/H400
+ */
+static int mmc_scale_high(struct mmc_host *host)
+{
+ int err = 0;
+
+ if (mmc_card_ddr52(host->card)) {
+ mmc_set_timing(host, MMC_TIMING_LEGACY);
+ mmc_set_clock(host, MMC_HIGH_26_MAX_DTR);
+ }
+
+ if (!host->card->ext_csd.strobe_support) {
+ if (!(host->card->mmc_avail_type & EXT_CSD_CARD_TYPE_HS200)) {
+ pr_err("%s: %s: card does not support HS200\n",
+ mmc_hostname(host), __func__);
+ WARN_ON(1);
+ return -EPERM;
+ }
+
+ err = mmc_select_hs200(host->card);
+ if (err) {
+ pr_err("%s: %s: selecting HS200 failed (%d)\n",
+ mmc_hostname(host), __func__, err);
+ return err;
+ }
+
+ mmc_set_bus_speed(host->card);
+
+ err = mmc_hs200_tuning(host->card);
+ if (err) {
+ pr_err("%s: %s: hs200 tuning failed (%d)\n",
+ mmc_hostname(host), __func__, err);
+ return err;
+ }
+
+ if (!(host->card->mmc_avail_type & EXT_CSD_CARD_TYPE_HS400)) {
+ pr_debug("%s: card does not support HS400\n",
+ mmc_hostname(host));
+ return 0;
+ }
+ }
+
+ err = mmc_select_hs400(host->card);
+ if (err) {
+ pr_err("%s: %s: select hs400 failed (%d)\n",
+ mmc_hostname(host), __func__, err);
+ return err;
+ }
+
+ return 0;
+}
+
+static int mmc_set_clock_bus_speed(struct mmc_card *card, unsigned long freq)
+{
+ int err = 0;
+
+ if (freq == MMC_HS200_MAX_DTR)
+ err = mmc_scale_high(card->host);
+ else
+ err = mmc_scale_low(card->host, freq);
+
+ return err;
+}
+
+static inline unsigned long mmc_ddr_freq_accommodation(unsigned long freq)
+{
+ if (freq == MMC_HIGH_DDR_MAX_DTR)
+ return freq;
+
+ return freq/2;
+}
+
+/**
+ * mmc_change_bus_speed() - Change MMC card bus frequency at runtime
+ * @host: pointer to mmc host structure
+ * @freq: pointer to desired frequency to be set
+ *
+ * Change the MMC card bus frequency at runtime after the card is
+ * initialized. Callers are expected to make sure of the card's
+ * state (DATA/RCV/TRANSFER) before changing the frequency at runtime.
+ *
+ * If the frequency to change is greater than max. supported by card,
+ * *freq is changed to max. supported by card. If it is less than min.
+ * supported by host, *freq is changed to min. supported by host.
+ * Host is assumed to be calimed while calling this funciton.
+ */
+static int mmc_change_bus_speed(struct mmc_host *host, unsigned long *freq)
+{
+ int err = 0;
+ struct mmc_card *card;
+ unsigned long actual_freq;
+
+ card = host->card;
+
+ if (!card || !freq) {
+ err = -EINVAL;
+ goto out;
+ }
+ actual_freq = *freq;
+
+ WARN_ON(!host->claimed);
+
+ /*
+ * For scaling up/down HS400 we'll need special handling,
+ * for other timings we can simply do clock frequency change
+ */
+ if (mmc_card_hs400(card) ||
+ (!mmc_card_hs200(host->card) && *freq == MMC_HS200_MAX_DTR)) {
+ err = mmc_set_clock_bus_speed(card, *freq);
+ if (err) {
+ pr_err("%s: %s: failed (%d)to set bus and clock speed (freq=%lu)\n",
+ mmc_hostname(host), __func__, err, *freq);
+ goto out;
+ }
+ } else if (mmc_card_hs200(host->card)) {
+ mmc_set_clock(host, *freq);
+ err = mmc_hs200_tuning(host->card);
+ if (err) {
+ pr_warn("%s: %s: tuning execution failed %d\n",
+ mmc_hostname(card->host),
+ __func__, err);
+ mmc_set_clock(host, host->clk_scaling.curr_freq);
+ }
+ } else {
+ if (mmc_card_ddr52(host->card))
+ actual_freq = mmc_ddr_freq_accommodation(*freq);
+ mmc_set_clock(host, actual_freq);
+ }
+
+out:
+ return err;
+}
+
/*
* Handle the detection and initialisation of a card.
*
diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c
index fe914ff..fc78c09 100644
--- a/drivers/mmc/core/sd.c
+++ b/drivers/mmc/core/sd.c
@@ -592,6 +592,60 @@ static int sd_set_current_limit(struct mmc_card *card, u8 *status)
return 0;
}

+/**
+ * mmc_sd_change_bus_speed() - Change SD card bus frequency at runtime
+ * @host: pointer to mmc host structure
+ * @freq: pointer to desired frequency to be set
+ *
+ * Change the SD card bus frequency at runtime after the card is
+ * initialized. Callers are expected to make sure of the card's
+ * state (DATA/RCV/TRANSFER) beforing changing the frequency at runtime.
+ *
+ * If the frequency to change is greater than max. supported by card,
+ * *freq is changed to max. supported by card and if it is less than min.
+ * supported by host, *freq is changed to min. supported by host.
+ */
+static int mmc_sd_change_bus_speed(struct mmc_host *host, unsigned long *freq)
+{
+ int err = 0;
+ struct mmc_card *card;
+
+ /*
+ * Assign card pointer after claiming host to avoid race
+ * conditions that may arise during removal of the card.
+ */
+ card = host->card;
+
+ /* sanity checks */
+ if (!card || !freq) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ mmc_set_clock(host, (unsigned int) (*freq));
+
+ if (!mmc_host_is_spi(card->host) && mmc_card_uhs(card)
+ && card->host->ops->execute_tuning) {
+ /*
+ * We try to probe host driver for tuning for any
+ * frequency, it is host driver responsibility to
+ * perform actual tuning only when required.
+ */
+ err = card->host->ops->execute_tuning(card->host,
+ MMC_SEND_TUNING_BLOCK);
+ if (err) {
+ pr_warn("%s: %s: tuning execution failed %d. Restoring to previous clock %lu\n",
+ mmc_hostname(card->host), __func__, err,
+ host->clk_scaling.curr_freq);
+ mmc_set_clock(host, host->clk_scaling.curr_freq);
+ }
+ }
+
+out:
+ return err;
+}
+
+
/*
* UHS-I specific initialization procedure
*/
--
1.9.1

2019-10-21 14:38:20

by Ram Prakash Gupta

[permalink] [raw]
Subject: [RFC 6/6] dt-bindings: mmc: sdhci-msm: Add clk scaling dt parameters

Adding clk scaling dt parameters.

Signed-off-by: Ram Prakash Gupta <[email protected]>
---
Documentation/devicetree/bindings/mmc/sdhci-msm.txt | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/Documentation/devicetree/bindings/mmc/sdhci-msm.txt b/Documentation/devicetree/bindings/mmc/sdhci-msm.txt
index da4edb1..afaf88d 100644
--- a/Documentation/devicetree/bindings/mmc/sdhci-msm.txt
+++ b/Documentation/devicetree/bindings/mmc/sdhci-msm.txt
@@ -39,6 +39,21 @@ Required properties:
"cal" - reference clock for RCLK delay calibration (optional)
"sleep" - sleep clock for RCLK delay calibration (optional)

+Optional properties:
+- devfreq,freq-table - specifies supported frequencies for clock scaling.
+ Clock scaling logic shall toggle between these frequencies based
+ on card load. In case the defined frequencies are over or below
+ the supported card frequencies, they will be overridden
+ during card init. In case this entry is not supplied,
+ the driver will construct one based on the card
+ supported max and min frequencies.
+ The frequencies must be ordered from lowest to highest.
+
+- scaling-lower-bus-speed-mode - Few hosts can support DDR52 mode at the
+ same lower system voltage corner as high-speed mode. In such
+ cases, it is always better to put it in DDR mode which will
+ improve the performance without any power impact.
+
Example:

sdhc_1: sdhci@f9824900 {
@@ -56,6 +71,10 @@ Example:

clocks = <&gcc GCC_SDCC1_APPS_CLK>, <&gcc GCC_SDCC1_AHB_CLK>;
clock-names = "core", "iface";
+
+ devfreq,freq-table = <50000000 200000000>;
+ scaling-lower-bus-speed-mode = "DDR52"
+
};

sdhc_2: sdhci@f98a4900 {
--
1.9.1

2019-10-21 14:38:35

by Ram Prakash Gupta

[permalink] [raw]
Subject: [RFC 5/6] mmc: sdhci-msm: Add capability in platform host

Signed-off-by: Sujit Reddy Thumma <[email protected]>
Signed-off-by: Talel Shenhar <[email protected]>
Signed-off-by: Sahitya Tummala <[email protected]>
Signed-off-by: Asutosh Das <[email protected]>
Signed-off-by: Ritesh Harjani <[email protected]>
Signed-off-by: Veerabhadrarao Badiganti <[email protected]>
Signed-off-by: Bao D. Nguyen <[email protected]>
Signed-off-by: Can Guo <[email protected]>
Signed-off-by: Sayali Lokhande <[email protected]>
Co-Developed-by: Ram Prakash Gupta <[email protected]>
Signed-off-by: Ram Prakash Gupta <[email protected]>
---
drivers/mmc/host/sdhci-msm.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/mmc/host/sdhci-msm.c b/drivers/mmc/host/sdhci-msm.c
index b75c82d..67accf6 100644
--- a/drivers/mmc/host/sdhci-msm.c
+++ b/drivers/mmc/host/sdhci-msm.c
@@ -1934,6 +1934,8 @@ static int sdhci_msm_probe(struct platform_device *pdev)
goto clk_disable;
}

+ msm_host->mmc->caps2 |= MMC_CAP2_CLK_SCALE;
+
pm_runtime_get_noresume(&pdev->dev);
pm_runtime_set_active(&pdev->dev);
pm_runtime_enable(&pdev->dev);
--
1.9.1

2019-10-21 14:38:48

by Ram Prakash Gupta

[permalink] [raw]
Subject: [RFC 4/6] mmc: core: Add debugfs entries for scaling support

Add debugfs entries for scaling support.

Signed-off-by: Sujit Reddy Thumma <[email protected]>
Signed-off-by: Talel Shenhar <[email protected]>
Signed-off-by: Sahitya Tummala <[email protected]>
Signed-off-by: Asutosh Das <[email protected]>
Signed-off-by: Ritesh Harjani <[email protected]>
Signed-off-by: Veerabhadrarao Badiganti <[email protected]>
Signed-off-by: Bao D. Nguyen <[email protected]>
Signed-off-by: Can Guo <[email protected]>
Signed-off-by: Sayali Lokhande <[email protected]>
Co-Developed-by: Ram Prakash Gupta <[email protected]>
Signed-off-by: Ram Prakash Gupta <[email protected]>
---
drivers/mmc/core/debugfs.c | 90 ++++++++++++++++++++++++++
drivers/mmc/core/host.c | 153 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 243 insertions(+)

diff --git a/drivers/mmc/core/debugfs.c b/drivers/mmc/core/debugfs.c
index 09e0c76..a9f1e3f 100644
--- a/drivers/mmc/core/debugfs.c
+++ b/drivers/mmc/core/debugfs.c
@@ -222,6 +222,81 @@ static int mmc_clock_opt_set(void *data, u64 val)
DEFINE_SIMPLE_ATTRIBUTE(mmc_clock_fops, mmc_clock_opt_get, mmc_clock_opt_set,
"%llu\n");

+static int mmc_scale_get(void *data, u64 *val)
+{
+ struct mmc_host *host = data;
+
+ *val = host->clk_scaling.curr_freq;
+
+ return 0;
+}
+
+static int mmc_scale_set(void *data, u64 val)
+{
+ int err = 0;
+ struct mmc_host *host = data;
+
+ mmc_claim_host(host);
+
+ /* change frequency from sysfs manually */
+ err = mmc_clk_update_freq(host, val, host->clk_scaling.state);
+ if (err == -EAGAIN)
+ err = 0;
+ else if (err)
+ pr_err("%s: clock scale to %llu failed with error %d\n",
+ mmc_hostname(host), val, err);
+ else
+ pr_debug("%s: clock change to %llu finished successfully (%s)\n",
+ mmc_hostname(host), val, current->comm);
+
+ mmc_release_host(host);
+
+ return err;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(mmc_scale_fops, mmc_scale_get, mmc_scale_set,
+ "%llu\n");
+
+static int mmc_max_clock_get(void *data, u64 *val)
+{
+ struct mmc_host *host = data;
+
+ if (!host)
+ return -EINVAL;
+
+ *val = host->f_max;
+
+ return 0;
+}
+
+static int mmc_max_clock_set(void *data, u64 val)
+{
+ struct mmc_host *host = data;
+ int err = -EINVAL;
+ unsigned long freq = val;
+ unsigned int old_freq;
+
+ if (!host || (val < host->f_min))
+ goto out;
+
+ mmc_claim_host(host);
+ if (host->bus_ops && host->bus_ops->change_bus_speed) {
+ old_freq = host->f_max;
+ host->f_max = freq;
+
+ err = host->bus_ops->change_bus_speed(host, &freq);
+
+ if (err)
+ host->f_max = old_freq;
+ }
+ mmc_release_host(host);
+out:
+ return err;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(mmc_max_clock_fops, mmc_max_clock_get,
+ mmc_max_clock_set, "%llu\n");
+
void mmc_add_host_debugfs(struct mmc_host *host)
{
struct dentry *root;
@@ -235,6 +310,19 @@ void mmc_add_host_debugfs(struct mmc_host *host)
debugfs_create_file("clock", S_IRUSR | S_IWUSR, root, host,
&mmc_clock_fops);

+ if (!debugfs_create_file("max_clock", 0600, root, host,
+ &mmc_max_clock_fops))
+ goto err_node;
+
+ if (!debugfs_create_file("scale", 0600, root, host,
+ &mmc_scale_fops))
+ goto err_node;
+
+ if (!debugfs_create_bool("skip_clk_scale_freq_update",
+ 0600, root,
+ &host->clk_scaling.skip_clk_scale_freq_update))
+ goto err_node;
+
#ifdef CONFIG_FAIL_MMC_REQUEST
if (fail_request)
setup_fault_attr(&fail_default_attr, fail_request);
@@ -242,6 +330,8 @@ void mmc_add_host_debugfs(struct mmc_host *host)
fault_create_debugfs_attr("fail_mmc_request", root,
&host->fail_mmc_request);
#endif
+err_node:
+ return;
}

void mmc_remove_host_debugfs(struct mmc_host *host)
diff --git a/drivers/mmc/core/host.c b/drivers/mmc/core/host.c
index 9138041..c5b904b 100644
--- a/drivers/mmc/core/host.c
+++ b/drivers/mmc/core/host.c
@@ -522,6 +522,154 @@ struct mmc_host *mmc_alloc_host(int extra, struct device *dev)

EXPORT_SYMBOL(mmc_alloc_host);

+static ssize_t enable_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct mmc_host *host = cls_dev_to_mmc_host(dev);
+
+ if (!host)
+ return -EINVAL;
+
+ return snprintf(buf, PAGE_SIZE, "%d\n", mmc_can_scale_clk(host));
+}
+
+static ssize_t enable_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ struct mmc_host *host = cls_dev_to_mmc_host(dev);
+ unsigned long value;
+
+ if (!host || !host->card || kstrtoul(buf, 0, &value))
+ return -EINVAL;
+
+ mmc_get_card(host->card, NULL);
+
+ if (!value) {
+ /* Suspend the clock scaling and mask host capability */
+ if (host->clk_scaling.enable)
+ mmc_suspend_clk_scaling(host);
+ host->clk_scaling.enable = false;
+ host->caps2 &= ~MMC_CAP2_CLK_SCALE;
+ host->clk_scaling.state = MMC_LOAD_HIGH;
+ /* Set to max. frequency when disabling */
+ mmc_clk_update_freq(host, host->card->clk_scaling_highest,
+ host->clk_scaling.state);
+ } else if (value) {
+ /* Unmask host capability and resume scaling */
+ host->caps2 |= MMC_CAP2_CLK_SCALE;
+ if (!host->clk_scaling.enable) {
+ host->clk_scaling.enable = true;
+ mmc_resume_clk_scaling(host);
+ }
+ }
+
+ mmc_put_card(host->card, NULL);
+
+ return count;
+}
+
+static ssize_t up_threshold_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct mmc_host *host = cls_dev_to_mmc_host(dev);
+
+ if (!host)
+ return -EINVAL;
+
+ return snprintf(buf, PAGE_SIZE, "%d\n", host->clk_scaling.upthreshold);
+}
+
+#define MAX_PERCENTAGE 100
+static ssize_t up_threshold_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ struct mmc_host *host = cls_dev_to_mmc_host(dev);
+ unsigned long value;
+
+ if (!host || kstrtoul(buf, 0, &value) || (value > MAX_PERCENTAGE))
+ return -EINVAL;
+
+ host->clk_scaling.upthreshold = value;
+
+ pr_debug("%s: clkscale_up_thresh set to %lu\n",
+ mmc_hostname(host), value);
+ return count;
+}
+
+static ssize_t down_threshold_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct mmc_host *host = cls_dev_to_mmc_host(dev);
+
+ if (!host)
+ return -EINVAL;
+
+ return snprintf(buf, PAGE_SIZE, "%d\n",
+ host->clk_scaling.downthreshold);
+}
+
+static ssize_t down_threshold_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ struct mmc_host *host = cls_dev_to_mmc_host(dev);
+ unsigned long value;
+
+ if (!host || kstrtoul(buf, 0, &value) || (value > MAX_PERCENTAGE))
+ return -EINVAL;
+
+ host->clk_scaling.downthreshold = value;
+
+ pr_debug("%s: clkscale_down_thresh set to %lu\n",
+ mmc_hostname(host), value);
+ return count;
+}
+
+static ssize_t polling_interval_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct mmc_host *host = cls_dev_to_mmc_host(dev);
+
+ if (!host)
+ return -EINVAL;
+
+ return snprintf(buf, PAGE_SIZE, "%lu milliseconds\n",
+ host->clk_scaling.polling_delay_ms);
+}
+
+static ssize_t polling_interval_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ struct mmc_host *host = cls_dev_to_mmc_host(dev);
+ unsigned long value;
+
+ if (!host || kstrtoul(buf, 0, &value))
+ return -EINVAL;
+
+ host->clk_scaling.polling_delay_ms = value;
+
+ pr_debug("%s: clkscale_polling_delay_ms set to %lu\n",
+ mmc_hostname(host), value);
+ return count;
+}
+
+DEVICE_ATTR_RW(enable);
+DEVICE_ATTR_RW(polling_interval);
+DEVICE_ATTR_RW(up_threshold);
+DEVICE_ATTR_RW(down_threshold);
+
+static struct attribute *clk_scaling_attrs[] = {
+ &dev_attr_enable.attr,
+ &dev_attr_up_threshold.attr,
+ &dev_attr_down_threshold.attr,
+ &dev_attr_polling_interval.attr,
+ NULL,
+};
+
+static struct attribute_group clk_scaling_attr_grp = {
+ .name = "clk_scaling",
+ .attrs = clk_scaling_attrs,
+};
+
/**
* mmc_add_host - initialise host hardware
* @host: mmc host
@@ -551,6 +699,10 @@ int mmc_add_host(struct mmc_host *host)
#ifdef CONFIG_DEBUG_FS
mmc_add_host_debugfs(host);
#endif
+ err = sysfs_create_group(&host->class_dev.kobj, &clk_scaling_attr_grp);
+ if (err)
+ pr_err("%s: failed to create clk scale sysfs group with err %d\n",
+ __func__, err);

mmc_start_host(host);
mmc_register_pm_notifier(host);
@@ -577,6 +729,7 @@ void mmc_remove_host(struct mmc_host *host)
mmc_remove_host_debugfs(host);
#endif

+ sysfs_remove_group(&host->class_dev.kobj, &clk_scaling_attr_grp);
device_del(&host->class_dev);

led_trigger_unregister_simple(host->led);
--
1.9.1

2019-10-22 08:42:38

by Ulf Hansson

[permalink] [raw]
Subject: Re: [RFC 0/6] mmc: Add clock scaling support for mmc driver

On Mon, 21 Oct 2019 at 16:30, Ram Prakash Gupta <[email protected]> wrote:
>
> This change adds the use of devfreq based clock scaling to MMC.
> This applicable for eMMC and SDCard.
> For some workloads, such as video playback, it isn't necessary
> for these cards to run at high speed. Running at lower
> frequency, in such cases can still meet the deadlines for data
> transfers.
>
> Scaling down the clock frequency dynamically has power savings
> not only because the bus is running at lower frequency but also
> has an advantage of scaling down the system core voltage, if
> supported. Provide an ondemand clock scaling support similar
> to the cpufreq ondemand governor having two thresholds,
> up_threshold and down_threshold to decide whether to increase
> the frequency or scale it down respectively as per load.

This sounds simple, but what the series is doing is far more
complicated but scaling the bus clock, as it also re-negotiates the
bus speed mode.

Each time the triggering point for scaling up/down is hit, then a
series of commands needs to be sent to the card, including running the
tuning procedure. The point is, for sure, this doesn't come for free,
both from a latency point of view, but also from an energy cost point
of view. So, whether this really improves the behaviour, seems like
very use case sensitive, right!?

Overall, when it comes to use cases, we have very limited knowledge
about them at the mmc block layer level. I think it should remain like
that. If at any place at all, this information is better maintained by
the generic block layer and potentially the configured I/O scheduler.

This brings me to a question about the tests you have you run. Can you
share some information and data about that?

>
>
> Ram Prakash Gupta (6):
> mmc: core: Parse clk scaling dt entries
> mmc: core: Add core scaling support in driver
> mmc: core: Initialize clk scaling for mmc and SDCard
> mmc: core: Add debugfs entries for scaling support
> mmc: sdhci-msm: Add capability in platfrom host
> dt-bindings: mmc: sdhci-msm: Add clk scaling dt parameters
>
> .../devicetree/bindings/mmc/sdhci-msm.txt | 19 +

I noticed that the DT patch was put last in the series, but the
parsing is implemented in the first patch. Please flip this around. If
you want to implement DT parsing of new bindings, please make sure to
discuss the new bindings first.

> drivers/mmc/core/block.c | 19 +-
> drivers/mmc/core/core.c | 777 +++++++++++++++++++++
> drivers/mmc/core/core.h | 17 +
> drivers/mmc/core/debugfs.c | 90 +++
> drivers/mmc/core/host.c | 226 ++++++
> drivers/mmc/core/mmc.c | 246 ++++++-
> drivers/mmc/core/queue.c | 2 +
> drivers/mmc/core/sd.c | 84 ++-
> drivers/mmc/host/sdhci-msm.c | 2 +
> include/linux/mmc/card.h | 7 +
> include/linux/mmc/host.h | 66 ++
> 12 files changed, 1550 insertions(+), 5 deletions(-)

This is a lot of new code in the mmc core, which I would then need to
maintain, of course. I have to admit, I am a bit concerned about that,
so you have to convince me that there are good reasons for me to apply
this.

As I stated above, I think the approach looks quite questionable in
general. And even if you can share measurement, that it improves the
behaviour, I suspect (without a deeper code review) that some of the
code better belongs in common block device layer.

Kind regards
Uffe

2019-10-29 19:47:16

by Rob Herring (Arm)

[permalink] [raw]
Subject: Re: [RFC 6/6] dt-bindings: mmc: sdhci-msm: Add clk scaling dt parameters

On Mon, Oct 21, 2019 at 07:59:37PM +0530, Ram Prakash Gupta wrote:
> Adding clk scaling dt parameters.
>
> Signed-off-by: Ram Prakash Gupta <[email protected]>
> ---
> Documentation/devicetree/bindings/mmc/sdhci-msm.txt | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/Documentation/devicetree/bindings/mmc/sdhci-msm.txt b/Documentation/devicetree/bindings/mmc/sdhci-msm.txt
> index da4edb1..afaf88d 100644
> --- a/Documentation/devicetree/bindings/mmc/sdhci-msm.txt
> +++ b/Documentation/devicetree/bindings/mmc/sdhci-msm.txt
> @@ -39,6 +39,21 @@ Required properties:
> "cal" - reference clock for RCLK delay calibration (optional)
> "sleep" - sleep clock for RCLK delay calibration (optional)
>
> +Optional properties:
> +- devfreq,freq-table - specifies supported frequencies for clock scaling.
> + Clock scaling logic shall toggle between these frequencies based
> + on card load. In case the defined frequencies are over or below
> + the supported card frequencies, they will be overridden
> + during card init. In case this entry is not supplied,
> + the driver will construct one based on the card
> + supported max and min frequencies.
> + The frequencies must be ordered from lowest to highest.

This should be common. Surely we already have something?

> +
> +- scaling-lower-bus-speed-mode - Few hosts can support DDR52 mode at the
> + same lower system voltage corner as high-speed mode. In such
> + cases, it is always better to put it in DDR mode which will
> + improve the performance without any power impact.

The description sounds like a boolean. Why the string? What are possible
values?

Also needs a 'qcom' vendor prefix.

> +
> Example:
>
> sdhc_1: sdhci@f9824900 {
> @@ -56,6 +71,10 @@ Example:
>
> clocks = <&gcc GCC_SDCC1_APPS_CLK>, <&gcc GCC_SDCC1_AHB_CLK>;
> clock-names = "core", "iface";
> +
> + devfreq,freq-table = <50000000 200000000>;
> + scaling-lower-bus-speed-mode = "DDR52"
> +
> };
>
> sdhc_2: sdhci@f98a4900 {
> --
> 1.9.1
>

2019-11-15 17:41:23

by Doug Anderson

[permalink] [raw]
Subject: Re: [RFC 0/6] mmc: Add clock scaling support for mmc driver

Hi,

On Fri, Nov 15, 2019 at 3:13 AM Ram Prakash Gupta
<[email protected]> wrote:
>
> Each time the triggering point for scaling up/down is hit, then a
> series of commands needs to be sent to the card, including running the
> tuning procedure. The point is, for sure, this doesn't come for free,
> both from a latency point of view, but also from an energy cost point
> of view. So, whether this really improves the behaviour, seems like
> very use case sensitive, right!?
>
> With clock scaling support device mode would be switched between low speed
> (hs50 or ddr52) and high speed mode(hs400 enhanced strobe).

I haven't read the patches, but just from this description it feels
like the wrong way to go. From my understanding if you're running at
HS400 then you can run the card at any speed between 0 and 200 MHz. I
see no reason why you'd want to switch modes. Just stay at HS400 and
slow down the clock, right? Then you can keep the "Enhanced Strobe"
which makes everything more reliable.

If you're running on a system that doesn't have enhanced strobe you
_should_ still be able to switch clock speeds without switching modes
since the spec has just a _maximum_ clock frequency for each mode, so
HS200, DDR50, etc should all be able to run at slower speeds without
an official mode switch. You also shouldn't have to re-tune. Tuning
is nothing magical, it's just trying to find the delay between sending
a pulse on the clock line and when you read the data sent by the other
side. Assuming your tuning delay is natively represented in "phase
offset", you can convert that back to an actual delay and then back to
"phase offset" for the new clock.

To further argue against switching modes, I would also note that for
SD cards switching to a slower speed mode may result in an increase in
IO voltage, which seems like it could be bad for power consumption?


> And from energy point of view, this feature is only helping in saving energy
> and not adding any energy penalty. And we have observed a considerable amount
> of power saving(data shared in mid) when playing 1080p video playback with
> clock scaling feature support.

I am slightly baffled about why this would save energy unless it
allows you to lower the voltage to the controller. I know you _can_
sometimes lower the voltage to the controller on Qualcomm parts, but
you are arguing that it is useful even on systems that can't lower the
voltage. That feels slightly questionable. I would expect that
racing to idle (with the right tuning parameters) would be a better
way to go.

As a specific example, let's imagine we want to transfer 1000 MB of
data and we have 20 seconds with which to do it. We can achieve this
by transferring 50 MB/s for the whole 20 seconds. Alternatively, we
could transfer at 400 MB/s 2.5 seconds and then fully power gate the
eMMC for the next 17.5 seconds.

In that example, I'd wonder ask is more power efficient. Presumably
the 2nd. This is the whole "race to idle" concept as I understand it.

The "race to idle" is talked about a lot in the context of CPU
frequency decisions. Presumably you'll point out that "race to idle"
is NOT the right thing to do for choosing a CPU frequency. As I
understand it, this is primarily true because we need to raise the CPU
voltage to run at faster speeds. This would lead me to believe that
the only case you'd want to do frequency scaling like this is if it
allows you to lower the voltage provided to the eMMC controller. As
you said, for Qualcomm it _does_ allow you to do this, but most other
SoCs don't. ...so unless there's a flaw in my logic (always
possible!) this patch series should be amended to say it's only useful
if it allows you to down-volt the controller.

Just to think a little bit more about my logic: of course, you might
argue that we can't just do a 1000 MB data transfer. We can break
that down into two cases:

a) For writing, presumably the data is produced over time and you
don't want to buffer the whole 1000 MB and delay 17.5 seconds before
you start writing. ...but presumably you could do _some_ buffering
and then break things into chunks where you ungate the clock to the
card, write a chunk out, and then re-gate the clock. There will
obviously be some overhead with each clock gate/ungate, but
(hopefully) not too much. ...and there will be time when data is in
RAM and not on the disk so you'd worry about power failure, but if you
want to get data on the disk ASAP why are you scaling the clock (and
thus delaying the data from getting to the disk) at all? Maybe some
math? How long does it take to ungate/gate the clocks. 1 ms? It's
just flipping a bit, right? ...and does assuming we can allocate a 40
MB write buffer seem sane? So we eat 1 ms to ungate, 100 ms to
transfer 40 MB, 1 ms to gate. Compared to transferring at 50 MB/sec
(w/ no gating), we'd transfer the same 40 MB in 800 ms. So we either
keep the clock on at 50 MHz for 800 ms or we keep it on at 200 MHz for
102 ms and gate it for 698 ms.

b) If you're reading data then hopefully the system has some sort of
readahead going on. In the "video playback" case the system should
have no problem predicting that if you've just read bytes
1,000,000,000 - 2,000,000,000 of a file over the last 10 seconds that
you're likely to keep reading the same file. Presumably it wouldn't
be totally insane to read 40 MB ahead of time and then we can do the
same math as a). If 40 MB is too much for readahead, then shrink it
and redo the math. Even with much smaller numbers the "race to idle"
wins because gating / ungating clocks is fast. ...or do you know some
reason why gating / ungating clocks needs to be slow? If so, how
slow?


> Test case used are 1080p and 4k video playback use case. Please find below
> test case information and power impact data. In both the below video playback
> cases, a considerable amount of power savings can be observed with clock scaling
> feature.
>
> Use cases Delta at battery (mA) Power impact %
> 30 fps at HD 1080p decode 20 Mbps 10 mA 11%
> 30 fps at UHD 8b H.264 42 Mbps 20.93 mA 19%

Numbers like this are exactly what is needed to justify your patch
series. ...but I'd be super curious to how it would compare to:

1) Tuning the runtime PM auto-suspend delay. If you have your
auto-suspend delay set wrong (like 500 ms) then all the math above is
totally wrong. We'll keep clocking at 400 MHz needlessly even though
there is no data to transfer. If autosuspend is just gating clocks
then it feels like you could set it to 1 ms, or 10 ms. NOTE: if
autosuspend for you is something more major (fully turning off power
rails to the eMMC) then maybe you need another level where you just
turn off the clocks. Seems like we could find some way to make that
work.

2) Tuning any readached mechanism in your system. If your system
somehow does zero readahead then obviously all my arguments don't work
for reads. ...but why would you not have readahead?

3) Tuning any write buffering in your system. Same argument as #2.

4) Making sure that when the MMC card clock is gated that you request
the lowest voltage level for the controller (and set the controller's
bus clock to the lowest level since it's not doing anything).


I would also be very interested to know how much of those savings are
achieved if you keep the voltage to the MMC controller the same. In
other words do something that arbitrarily keeps the MMC controller
requesting the same voltage level from the rest of the system and then
do your power measurements. How much do your savings change?


I will also note that aggressive clock gating is exactly what the
dw_mmc controller does automatically (except for SDIO, but that's a
different story). Seeing that the controller itself can stop the
clock in dw_mmc gives further credence that gating / ungating the
clock is a very lightweight operation and 1 ms is probably an
over-estimation of how long it takes.


I guess one very last note is that I spent most of the time above
arguing that the clock scaling concept is probably not useful for any
SoCs where you can't adjust the voltage provided to the MMC
controller. That doesn't necessarily mean that your patch series is
useful for SoCs where you can. Specifically you'd need to do math to
see how much more power the MMC controller takes at the higher
voltage. Then you can calculate a "perf per watt". If the watts to
transfer data at 400 MB/s aren't 8 times more than the watts to
transfer at 50 MB/s then that's a ding against your idea. You'd also
don't have a dedicated voltage rail, right? So you'd have to see what
percentage of the time the MMC controller was the only thing in the
system that was requesting the higher voltage. If it happened that
something else in the system was keeping the voltage higher anyway
then it would be better for you to run faster and race to idle.
Really I guess the case you're most worried about is if the MMC
controller is causing other components in the system to be at a higher
voltage point (and thus take up more power)...


Hope that all makes sense. I can read the patches themselves if
needed--everything from above is just based on your cover letter and
discussion with Ulf. ;-)


-Doug

2019-11-29 07:39:06

by Ram Prakash Gupta

[permalink] [raw]
Subject: Re: [RFC 0/6] mmc: Add clock scaling support for mmc driver

Hi Ulf,

Seems some setting issue with my thunderbird application.
Sorry for spams, please ignore my last responses as unsupported
characters got added.

Typing my response again from browser and re-sending.

Thanks,
Ram

On 2019-10-22 14:10, Ulf Hansson wrote:
> On Mon, 21 Oct 2019 at 16:30, Ram Prakash Gupta
> <[email protected]> wrote:
>>
>> This change adds the use of devfreq based clock scaling to MMC.
>> This applicable for eMMC and SDCard.
>> For some workloads, such as video playback, it isn't necessary
>> for these cards to run at high speed. Running at lower
>> frequency, in such cases can still meet the deadlines for data
>> transfers.
>>
>> Scaling down the clock frequency dynamically has power savings
>> not only because the bus is running at lower frequency but also
>> has an advantage of scaling down the system core voltage, if
>> supported. Provide an ondemand clock scaling support similar
>> to the cpufreq ondemand governor having two thresholds,
>> up_threshold and down_threshold to decide whether to increase
>> the frequency or scale it down respectively as per load.
>
> This sounds simple, but what the series is doing is far more
> complicated but scaling the bus clock, as it also re-negotiates the
> bus speed mode.
>
> Each time the triggering point for scaling up/down is hit, then a
> series of commands needs to be sent to the card, including running the
> tuning procedure. The point is, for sure, this doesn't come for free,
> both from a latency point of view, but also from an energy cost point
> of view. So, whether this really improves the behaviour, seems like
> very use case sensitive, right!?

Switching modes would incur some latency for sending commands to switch
modes, but tuning is not needed as most of the emmc devices used now a
days are with enhanced strobe support, so tuning would not add up any
latency as it is not required in hs400 enhanced strobe mode.

This feature is implemented for video playback case, where data transfer
request is less, where this feature helps with saving power consumption.

And when there is burst of data transfer request, load will remain
_high_
so there won't be any switching and hence it won't affect any existing
use cases from latency point of view.

Also if hw supports to switch clk frequency without changing mode. I
will
make change in code. For this I have seek input from hw team.

From collected data, I see this feature is helping in saving power
consumption. And no energy penalty is observed. Please share if I am
missing any specific. Power saving using this feature is quite good
and considerable. Please find the data below.

Use Case Delta at Battery Power Impact
30 fps at HD 1080p decode 20Mbps 10 mA 11%
30 fps at UHD 8b H.264 42 Mbps 20.93 mA 19%

>
> Overall, when it comes to use cases, we have very limited knowledge
> about them at the mmc block layer level. I think it should remain like
> that. If at any place at all, this information is better maintained by
> the generic block layer and potentially the configured I/O scheduler.

I think, generic block layer do not have knowledge of use case for data
transfer request. And devfreq framework have been used to implement this
feature, which should be same in any layer.

Also mobile platforms comes mostly with emmc and ufs as storage media.
And clock scaling is already implemented in upstream ufs driver using
devfreq framework. On similar line, this feature is implemented for mmc
driver. So I believe, clk scaling implementation is better placed in mmc
driver rather in generic block layer.

>
> This brings me to a question about the tests you have you run. Can you
> share some information and data about that?

Test case used are 1080p and 4k video playback use case. As this feature
is implemented specifically for video playback use case.
>
>>
>>
>> Ram Prakash Gupta (6):
>> mmc: core: Parse clk scaling dt entries
>> mmc: core: Add core scaling support in driver
>> mmc: core: Initialize clk scaling for mmc and SDCard
>> mmc: core: Add debugfs entries for scaling support
>> mmc: sdhci-msm: Add capability in platfrom host
>> dt-bindings: mmc: sdhci-msm: Add clk scaling dt parameters
>>
>> .../devicetree/bindings/mmc/sdhci-msm.txt | 19 +
>
> I noticed that the DT patch was put last in the series, but the
> parsing is implemented in the first patch. Please flip this around. If
> you want to implement DT parsing of new bindings, please make sure to
> discuss the new bindings first.

I will update in next post.

>
>> drivers/mmc/core/block.c | 19 +-
>> drivers/mmc/core/core.c | 777
>> +++++++++++++++++++++
>> drivers/mmc/core/core.h | 17 +
>> drivers/mmc/core/debugfs.c | 90 +++
>> drivers/mmc/core/host.c | 226 ++++++
>> drivers/mmc/core/mmc.c | 246 ++++++-
>> drivers/mmc/core/queue.c | 2 +
>> drivers/mmc/core/sd.c | 84 ++-
>> drivers/mmc/host/sdhci-msm.c | 2 +
>> include/linux/mmc/card.h | 7 +
>> include/linux/mmc/host.h | 66 ++
>> 12 files changed, 1550 insertions(+), 5 deletions(-)
>
> This is a lot of new code in the mmc core, which I would then need to
> maintain, of course. I have to admit, I am a bit concerned about that,
> so you have to convince me that there are good reasons for me to apply
> this.
>
> As I stated above, I think the approach looks quite questionable in
> general. And even if you can share measurement, that it improves the
> behaviour, I suspect (without a deeper code review) that some of the
> code better belongs in common block device layer.

From the collected power data, I see this as good reason to have this
feature in mmc driver as number is quite considerable.

For approach, it would be helpful if you share your inputs regarding
this
approach. And as you have stated, this can be further discussed after a
review from you.

>
> Kind regards
> Uffe

2019-11-29 08:11:36

by Ram Prakash Gupta

[permalink] [raw]
Subject: Re: [RFC 0/6] mmc: Add clock scaling support for mmc driver

Hi,

On 2019-11-15 23:09, Doug Anderson wrote:
> Hi,
>
> On Fri, Nov 15, 2019 at 3:13 AM Ram Prakash Gupta
> <[email protected]> wrote:
>>
>> Each time the triggering point for scaling up/down is hit, then a
>> series of commands needs to be sent to the card, including running the
>> tuning procedure. The point is, for sure, this doesn't come for free,
>> both from a latency point of view, but also from an energy cost point
>> of view. So, whether this really improves the behaviour, seems like
>> very use case sensitive, right!?
>>
>> With clock scaling support device mode would be switched between low
>> speed
>> (hs50 or ddr52) and high speed mode(hs400 enhanced strobe).
>
> I haven't read the patches, but just from this description it feels
> like the wrong way to go. From my understanding if you're running at
> HS400 then you can run the card at any speed between 0 and 200 MHz. I
> see no reason why you'd want to switch modes. Just stay at HS400 and
> slow down the clock, right? Then you can keep the "Enhanced Strobe"
> which makes everything more reliable.
>
> If you're running on a system that doesn't have enhanced strobe you
> _should_ still be able to switch clock speeds without switching modes
> since the spec has just a _maximum_ clock frequency for each mode, so
> HS200, DDR50, etc should all be able to run at slower speeds without
> an official mode switch. You also shouldn't have to re-tune. Tuning
> is nothing magical, it's just trying to find the delay between sending
> a pulse on the clock line and when you read the data sent by the other
> side. Assuming your tuning delay is natively represented in "phase
> offset", you can convert that back to an actual delay and then back to
> "phase offset" for the new clock.

Thanks for the suggestion. I have seek input from hardware team on this.
I will get back on this.

>
> To further argue against switching modes, I would also note that for
> SD cards switching to a slower speed mode may result in an increase in
> IO voltage, which seems like it could be bad for power consumption?
>

For SDCard clk scaling, driver is not switching any mode. So this should
not affect any use case.

>
>> And from energy point of view, this feature is only helping in saving
>> energy
>> and not adding any energy penalty. And we have observed a considerable
>> amount
>> of power saving(data shared in mid) when playing 1080p video playback
>> with
>> clock scaling feature support.
>
> I am slightly baffled about why this would save energy unless it
> allows you to lower the voltage to the controller. I know you _can_
> sometimes lower the voltage to the controller on Qualcomm parts, but
> you are arguing that it is useful even on systems that can't lower the
> voltage. That feels slightly questionable. I would expect that
> racing to idle (with the right tuning parameters) would be a better
> way to go.

Sorry, if my explanation was misleading before. MMC driver is not
changing
card/controller voltage but by lowering clock frequency of card and
controller brings down _bus_ and _system_ voltage corners of device
which
helps in saving power consumption.

>
> As a specific example, let's imagine we want to transfer 1000 MB of
> data and we have 20 seconds with which to do it. We can achieve this
> by transferring 50 MB/s for the whole 20 seconds. Alternatively, we
> could transfer at 400 MB/s 2.5 seconds and then fully power gate the
> eMMC for the next 17.5 seconds.
>
> In that example, I'd wonder ask is more power efficient. Presumably
> the 2nd. This is the whole "race to idle" concept as I understand it.
>
> The "race to idle" is talked about a lot in the context of CPU
> frequency decisions. Presumably you'll point out that "race to idle"
> is NOT the right thing to do for choosing a CPU frequency. As I
> understand it, this is primarily true because we need to raise the CPU
> voltage to run at faster speeds. This would lead me to believe that
> the only case you'd want to do frequency scaling like this is if it
> allows you to lower the voltage provided to the eMMC controller. As
> you said, for Qualcomm it _does_ allow you to do this, but most other
> SoCs don't. ...so unless there's a flaw in my logic (always
> possible!) this patch series should be amended to say it's only useful
> if it allows you to down-volt the controller.
>
> Just to think a little bit more about my logic: of course, you might
> argue that we can't just do a 1000 MB data transfer. We can break
> that down into two cases:
>
> a) For writing, presumably the data is produced over time and you
> don't want to buffer the whole 1000 MB and delay 17.5 seconds before
> you start writing. ...but presumably you could do _some_ buffering
> and then break things into chunks where you ungate the clock to the
> card, write a chunk out, and then re-gate the clock. There will
> obviously be some overhead with each clock gate/ungate, but
> (hopefully) not too much. ...and there will be time when data is in
> RAM and not on the disk so you'd worry about power failure, but if you
> want to get data on the disk ASAP why are you scaling the clock (and
> thus delaying the data from getting to the disk) at all? Maybe some
> math? How long does it take to ungate/gate the clocks. 1 ms? It's
> just flipping a bit, right? ...and does assuming we can allocate a 40
> MB write buffer seem sane? So we eat 1 ms to ungate, 100 ms to
> transfer 40 MB, 1 ms to gate. Compared to transferring at 50 MB/sec
> (w/ no gating), we'd transfer the same 40 MB in 800 ms. So we either
> keep the clock on at 50 MHz for 800 ms or we keep it on at 200 MHz for
> 102 ms and gate it for 698 ms.
>

"race to idle" helps but this feature was implemented with focus on
video
playback case, where data transfer request to mmc driver spans over
entire
playback time of video. In this case, running device in low speed mode
helps.

> b) If you're reading data then hopefully the system has some sort of
> readahead going on. In the "video playback" case the system should
> have no problem predicting that if you've just read bytes
> 1,000,000,000 - 2,000,000,000 of a file over the last 10 seconds that
> you're likely to keep reading the same file. Presumably it wouldn't
> be totally insane to read 40 MB ahead of time and then we can do the
> same math as a). If 40 MB is too much for readahead, then shrink it
> and redo the math. Even with much smaller numbers the "race to idle"
> wins because gating / ungating clocks is fast. ...or do you know some
> reason why gating / ungating clocks needs to be slow? If so, how
> slow?
>

I have performed one experiment by increasing read ahead size, but that
is
not helping. And I don't observe much difference in data request pattern
generated in video playback case.

>
>> Test case used are 1080p and 4k video playback use case. Please find
>> below
>> test case information and power impact data. In both the below video
>> playback
>> cases, a considerable amount of power savings can be observed with
>> clock scaling
>> feature.
>>
>> Use cases Delta at battery (mA) Power impact %
>> 30 fps at HD 1080p decode 20 Mbps 10 mA 11%
>> 30 fps at UHD 8b H.264 42 Mbps 20.93 mA 19%
>
> Numbers like this are exactly what is needed to justify your patch
> series. ...but I'd be super curious to how it would compare to:
>
> 1) Tuning the runtime PM auto-suspend delay. If you have your
> auto-suspend delay set wrong (like 500 ms) then all the math above is
> totally wrong. We'll keep clocking at 400 MHz needlessly even though
> there is no data to transfer. If autosuspend is just gating clocks
> then it feels like you could set it to 1 ms, or 10 ms. NOTE: if
> autosuspend for you is something more major (fully turning off power
> rails to the eMMC) then maybe you need another level where you just
> turn off the clocks. Seems like we could find some way to make that
> work.

Gating / Ungating can be fine tuned to help bring down power consumption
too. I will share power numbers with tuned parameters in next
communication.

>
> 2) Tuning any readached mechanism in your system. If your system
> somehow does zero readahead then obviously all my arguments don't work
> for reads. ...but why would you not have readahead?
>
> 3) Tuning any write buffering in your system. Same argument as #2.

This feature is specific to video playback use case from storage device.
Not sure, which buffering can be tuned. Can you point out any buffering
used?

>
> 4) Making sure that when the MMC card clock is gated that you request
> the lowest voltage level for the controller (and set the controller's
> bus clock to the lowest level since it's not doing anything).
>
>
> I would also be very interested to know how much of those savings are
> achieved if you keep the voltage to the MMC controller the same. In
> other words do something that arbitrarily keeps the MMC controller
> requesting the same voltage level from the rest of the system and then
> do your power measurements. How much do your savings change?
>
>
> I will also note that aggressive clock gating is exactly what the
> dw_mmc controller does automatically (except for SDIO, but that's a
> different story). Seeing that the controller itself can stop the
> clock in dw_mmc gives further credence that gating / ungating the
> clock is a very lightweight operation and 1 ms is probably an
> over-estimation of how long it takes.
>
>
> I guess one very last note is that I spent most of the time above
> arguing that the clock scaling concept is probably not useful for any
> SoCs where you can't adjust the voltage provided to the MMC
> controller. That doesn't necessarily mean that your patch series is
> useful for SoCs where you can. Specifically you'd need to do math to
> see how much more power the MMC controller takes at the higher
> voltage. Then you can calculate a "perf per watt". If the watts to
> transfer data at 400 MB/s aren't 8 times more than the watts to
> transfer at 50 MB/s then that's a ding against your idea. You'd also
> don't have a dedicated voltage rail, right? So you'd have to see what
> percentage of the time the MMC controller was the only thing in the
> system that was requesting the higher voltage. If it happened that
> something else in the system was keeping the voltage higher anyway
> then it would be better for you to run faster and race to idle.
> Really I guess the case you're most worried about is if the MMC
> controller is causing other components in the system to be at a higher
> voltage point (and thus take up more power)...
>

Rail is not dedicated for eMMC device. So during video playback its mmc
votes only, which keeps device in higher voltage corners. So for a three
hours of a video playback, power consumption due to mmc vote would be
quite
considerable.

>
> Hope that all makes sense. I can read the patches themselves if
> needed--everything from above is just based on your cover letter and
> discussion with Ulf. ;-)
>
>
> -Doug

Thanks,
Ram

2019-12-04 02:10:51

by Doug Anderson

[permalink] [raw]
Subject: Re: [RFC 0/6] mmc: Add clock scaling support for mmc driver

Hi,

On Fri, Nov 29, 2019 at 12:10 AM <[email protected]> wrote:
>
> > I am slightly baffled about why this would save energy unless it
> > allows you to lower the voltage to the controller. I know you _can_
> > sometimes lower the voltage to the controller on Qualcomm parts, but
> > you are arguing that it is useful even on systems that can't lower the
> > voltage. That feels slightly questionable. I would expect that
> > racing to idle (with the right tuning parameters) would be a better
> > way to go.
>
> Sorry, if my explanation was misleading before. MMC driver is not
> changing
> card/controller voltage but by lowering clock frequency of card and
> controller brings down _bus_ and _system_ voltage corners of device
> which
> helps in saving power consumption.

Yes, I understood. You are able to lower the voltage supplied to the
controller itself, not to the card. ...but that is not something that
all SoCs can do. I think here you are arguing that your feature is
useful to everyone--even if they can't ever lower the voltage of the
controller. I am questioning that.


> > As a specific example, let's imagine we want to transfer 1000 MB of
> > data and we have 20 seconds with which to do it. We can achieve this
> > by transferring 50 MB/s for the whole 20 seconds. Alternatively, we
> > could transfer at 400 MB/s 2.5 seconds and then fully power gate the
> > eMMC for the next 17.5 seconds.
> >
> > In that example, I'd wonder ask is more power efficient. Presumably
> > the 2nd. This is the whole "race to idle" concept as I understand it.
> >
> > The "race to idle" is talked about a lot in the context of CPU
> > frequency decisions. Presumably you'll point out that "race to idle"
> > is NOT the right thing to do for choosing a CPU frequency. As I
> > understand it, this is primarily true because we need to raise the CPU
> > voltage to run at faster speeds. This would lead me to believe that
> > the only case you'd want to do frequency scaling like this is if it
> > allows you to lower the voltage provided to the eMMC controller. As
> > you said, for Qualcomm it _does_ allow you to do this, but most other
> > SoCs don't. ...so unless there's a flaw in my logic (always
> > possible!) this patch series should be amended to say it's only useful
> > if it allows you to down-volt the controller.
> >
> > Just to think a little bit more about my logic: of course, you might
> > argue that we can't just do a 1000 MB data transfer. We can break
> > that down into two cases:
> >
> > a) For writing, presumably the data is produced over time and you
> > don't want to buffer the whole 1000 MB and delay 17.5 seconds before
> > you start writing. ...but presumably you could do _some_ buffering
> > and then break things into chunks where you ungate the clock to the
> > card, write a chunk out, and then re-gate the clock. There will
> > obviously be some overhead with each clock gate/ungate, but
> > (hopefully) not too much. ...and there will be time when data is in
> > RAM and not on the disk so you'd worry about power failure, but if you
> > want to get data on the disk ASAP why are you scaling the clock (and
> > thus delaying the data from getting to the disk) at all? Maybe some
> > math? How long does it take to ungate/gate the clocks. 1 ms? It's
> > just flipping a bit, right? ...and does assuming we can allocate a 40
> > MB write buffer seem sane? So we eat 1 ms to ungate, 100 ms to
> > transfer 40 MB, 1 ms to gate. Compared to transferring at 50 MB/sec
> > (w/ no gating), we'd transfer the same 40 MB in 800 ms. So we either
> > keep the clock on at 50 MHz for 800 ms or we keep it on at 200 MHz for
> > 102 ms and gate it for 698 ms.
> >
>
> "race to idle" helps but this feature was implemented with focus on
> video
> playback case, where data transfer request to mmc driver spans over
> entire
> playback time of video. In this case, running device in low speed mode
> helps.

It doesn't matter if you've got a long playback. A properly tuned
"race to idle" should still be better unless you are seriously saving
a lot of power by never bringing the voltage up. The above example is
with a fixed size transfer, but just imagine the same thing over and
over again and you've got the video playback case.


> > b) If you're reading data then hopefully the system has some sort of
> > readahead going on. In the "video playback" case the system should
> > have no problem predicting that if you've just read bytes
> > 1,000,000,000 - 2,000,000,000 of a file over the last 10 seconds that
> > you're likely to keep reading the same file. Presumably it wouldn't
> > be totally insane to read 40 MB ahead of time and then we can do the
> > same math as a). If 40 MB is too much for readahead, then shrink it
> > and redo the math. Even with much smaller numbers the "race to idle"
> > wins because gating / ungating clocks is fast. ...or do you know some
> > reason why gating / ungating clocks needs to be slow? If so, how
> > slow?
> >
>
> I have performed one experiment by increasing read ahead size, but that
> is
> not helping. And I don't observe much difference in data request pattern
> generated in video playback case.

You have to make sure that you are proactively gating the card clock
in conjunction. If you aren't gating the card clock quickly enough
(and aren't lowering the controller voltage when the card clock is
gated) then you won't notice the savings.


> >> Test case used are 1080p and 4k video playback use case. Please find
> >> below
> >> test case information and power impact data. In both the below video
> >> playback
> >> cases, a considerable amount of power savings can be observed with
> >> clock scaling
> >> feature.
> >>
> >> Use cases Delta at battery (mA) Power impact %
> >> 30 fps at HD 1080p decode 20 Mbps 10 mA 11%
> >> 30 fps at UHD 8b H.264 42 Mbps 20.93 mA 19%
> >
> > Numbers like this are exactly what is needed to justify your patch
> > series. ...but I'd be super curious to how it would compare to:
> >
> > 1) Tuning the runtime PM auto-suspend delay. If you have your
> > auto-suspend delay set wrong (like 500 ms) then all the math above is
> > totally wrong. We'll keep clocking at 400 MHz needlessly even though
> > there is no data to transfer. If autosuspend is just gating clocks
> > then it feels like you could set it to 1 ms, or 10 ms. NOTE: if
> > autosuspend for you is something more major (fully turning off power
> > rails to the eMMC) then maybe you need another level where you just
> > turn off the clocks. Seems like we could find some way to make that
> > work.
>
> Gating / Ungating can be fine tuned to help bring down power consumption
> too. I will share power numbers with tuned parameters in next
> communication.

Thanks, I think this is critical for the discussion. Please make sure
that when you are gating you code it up in a way that you can remove
the vote for the higher controller voltage.


> > 2) Tuning any readached mechanism in your system. If your system
> > somehow does zero readahead then obviously all my arguments don't work
> > for reads. ...but why would you not have readahead?
> >
> > 3) Tuning any write buffering in your system. Same argument as #2.
>
> This feature is specific to video playback use case from storage device.
> Not sure, which buffering can be tuned. Can you point out any buffering
> used?

If you're only testing video playback then I guess you don't care so
much here. I was assuming you cared about video record too, but maybe
that's not such an important use case for saving power.


> > 4) Making sure that when the MMC card clock is gated that you request
> > the lowest voltage level for the controller (and set the controller's
> > bus clock to the lowest level since it's not doing anything).
> >
> >
> > I would also be very interested to know how much of those savings are
> > achieved if you keep the voltage to the MMC controller the same. In
> > other words do something that arbitrarily keeps the MMC controller
> > requesting the same voltage level from the rest of the system and then
> > do your power measurements. How much do your savings change?
> >
> >
> > I will also note that aggressive clock gating is exactly what the
> > dw_mmc controller does automatically (except for SDIO, but that's a
> > different story). Seeing that the controller itself can stop the
> > clock in dw_mmc gives further credence that gating / ungating the
> > clock is a very lightweight operation and 1 ms is probably an
> > over-estimation of how long it takes.
> >
> >
> > I guess one very last note is that I spent most of the time above
> > arguing that the clock scaling concept is probably not useful for any
> > SoCs where you can't adjust the voltage provided to the MMC
> > controller. That doesn't necessarily mean that your patch series is
> > useful for SoCs where you can. Specifically you'd need to do math to
> > see how much more power the MMC controller takes at the higher
> > voltage. Then you can calculate a "perf per watt". If the watts to
> > transfer data at 400 MB/s aren't 8 times more than the watts to
> > transfer at 50 MB/s then that's a ding against your idea. You'd also
> > don't have a dedicated voltage rail, right? So you'd have to see what
> > percentage of the time the MMC controller was the only thing in the
> > system that was requesting the higher voltage. If it happened that
> > something else in the system was keeping the voltage higher anyway
> > then it would be better for you to run faster and race to idle.
> > Really I guess the case you're most worried about is if the MMC
> > controller is causing other components in the system to be at a higher
> > voltage point (and thus take up more power)...
> >
>
> Rail is not dedicated for eMMC device. So during video playback its mmc
> votes only, which keeps device in higher voltage corners. So for a three
> hours of a video playback, power consumption due to mmc vote would be
> quite
> considerable.

You should be able to remove your vote whenever the card clock is
gated. Maybe this is what you're missing?

-Doug