2012-05-03 14:24:06

by Venkatraman S

[permalink] [raw]
Subject: [PATCHv2 00/16] [FS, MM, block, MMC]: eMMC High Priority Interrupt Feature

Standard eMMC (Embedded MultiMedia Card) specification expects to execute
one request at a time. If some requests are more important than others, they
can't be aborted while the flash procedure is in progress.

New versions of the eMMC standard (4.41 and above) specfies a feature
called High Priority Interrupt (HPI). This enables an ongoing transaction
to be aborted using a special command (HPI command) so that the card is ready
to receive new commands immediately. Then the new request can be submitted
to the card, and optionally the interrupted command can be resumed again.

Some restrictions exist on when and how the command can be used. For example,
only write and write-like commands (ERASE) can be preempted, and the urgent
request must be a read.

In order to support this in software,
a) At the top level, some policy decisions have to be made on what is
worth preempting for.
This implementation uses the demand paging requests and swap
read requests as potential reads worth preempting an ongoing long write.
This is expected to provide improved responsiveness for smarphones
with multitasking capabilities - example would be launch a email application
while a video capture session (which causes long writes) is ongoing.
b) At the block handler, the higher priority request should be queued
ahead of the pending requests in the elevator
c) At the MMC block and core level, transactions have to executed to
enforce the rules of the MMC spec and make a reasonable tradeoff if the
ongoing command is really worth preempting. (For example, is it too close
to completing already ?).
The current implementation uses a fixed time logic. If 10ms has
already elapsed since the first request was submitted, then a new high
priority request would not cause a HPI, as it is expected that the first
request would finish soon. Further work is needed to dynamically tune
this value (maybe through sysfs) or automatically determine based on
average write times of previous requests.
d) At the lowest level (MMC host controllers), support interface to
provide a transition path for ongoing transactions to be aborted and the
controller to be ready to receive next command.

More information about this feature can be found at
Jedec Specification:-
http://www.jedec.org/standards-documents/docs/jesd84-a441
Presentation on eMMC4.41 features:-
http://www.jedec.org/sites/default/files/Victor_Tsai.pdf

Acknowledgements:-
In no particular order, thanks to Arnd Bergmann and Saugata Das
from Linaro, Ilan Smith and Alex Lemberg from Sandisk, Luca Porzio from Micron
Technologies, Yejin Moon, Jae Hoon Chung from Samsung and others.

----
v1 -> v2:-
* Convert threshold for hpi usage to a tuning parameter (sysfs)
* Add Documentation/ABI for all sysfs entries
* Add implementation of abort for OMAP controller
* Rebased to 3.4-rc4 + mmc-next

This patch series depends on a few other related cleanups
in MMC driver. All the patches and the dependent series can be
pulled from
git://github.com/svenkatr/linux.git my/mmc/3.4/foreground-hpiv2

----------------------------------------------------------------
Ilan Smith (3):
FS: Added demand paging markers to filesystem
MM: Added page swapping markers to memory management
block: treat DMPG and SWAPIN requests as special

Venkatraman S (13):
block: add queue attributes to manage dpmg and swapin requests
block: add sysfs attributes for runtime control of dpmg and swapin
block: Documentation: add sysfs ABI for expedite_dmpg and expedite_swapin
mmc: core: helper function for finding preemptible command
mmc: core: add preemptibility tracking fields to mmc command
mmc: core: Add MMC abort interface
mmc: block: Detect HPI support in card and host controller
mmc: core: Implement foreground request preemption procedure
mmc: sysfs: Add sysfs entry for tuning preempt_time_threshold
mmc: Documentation: Add sysfs ABI for hpi_time_threshold
mmc: block: Implement HPI invocation and handling logic.
mmc: Update preempted request with CORRECTLY_PRG_SECTORS_NUM info
mmc: omap_hsmmc: Implement abort_req host_ops

Documentation/ABI/testing/sysfs-block | 12 ++
Documentation/ABI/testing/sysfs-devices-mmc | 12 ++
block/blk-core.c | 18 +++
block/blk-sysfs.c | 16 +++
block/elevator.c | 14 ++-
drivers/mmc/card/block.c | 143 ++++++++++++++++++++--
drivers/mmc/card/queue.h | 1 +
drivers/mmc/core/core.c | 67 ++++++++++
drivers/mmc/core/mmc.c | 25 ++++
drivers/mmc/host/omap_hsmmc.c | 55 ++++++++-
fs/mpage.c | 2 +
include/linux/bio.h | 8 ++
include/linux/blk_types.h | 4 +
include/linux/blkdev.h | 8 ++
include/linux/mmc/card.h | 1 +
include/linux/mmc/core.h | 19 +++
include/linux/mmc/host.h | 1 +
include/linux/mmc/mmc.h | 4 +
mm/page_io.c | 3 +-
19 files changed, 395 insertions(+), 18 deletions(-)

--
1.7.10.rc2


2012-05-03 14:24:17

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 03/16] block: add queue attributes to manage dpmg and swapin requests

Add block queue properties to identify and manage demand paging
and swapin requests differently.

Signed-off-by: Ilan Smith <[email protected]>
Signed-off-by: Alex Lemberg <[email protected]>
Signed-off-by: Venkatraman S <[email protected]>
---
include/linux/blkdev.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2aa2466..e9187d4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -420,6 +420,8 @@ struct request_queue {
#define QUEUE_FLAG_ADD_RANDOM 16 /* Contributes to random pool */
#define QUEUE_FLAG_SECDISCARD 17 /* supports SECDISCARD */
#define QUEUE_FLAG_SAME_FORCE 18 /* force complete on same CPU */
+#define QUEUE_FLAG_EXP_DMPG 19 /* Expedite Demand paging requests */
+#define QUEUE_FLAG_EXP_SWAPIN 20 /* Expedit page swapping */

#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_STACKABLE) | \
@@ -502,6 +504,12 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
#define blk_queue_secdiscard(q) (blk_queue_discard(q) && \
test_bit(QUEUE_FLAG_SECDISCARD, &(q)->queue_flags))

+#define blk_queue_exp_dmpg(q) \
+ test_bit(QUEUE_FLAG_EXP_DMPG, &(q)->queue_flags)
+
+#define blk_queue_exp_swapin(q) \
+ test_bit(QUEUE_FLAG_EXP_SWAPIN, &(q)->queue_flags)
+
#define blk_noretry_request(rq) \
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
REQ_FAILFAST_DRIVER))
--
1.7.10.rc2

2012-05-03 14:24:28

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 05/16] block: Documentation: add sysfs ABI for expedite_dmpg and expedite_swapin

Add description on the usage of expedite_dmpg and
expedite_swapin.

Signed-off-by: Venkatraman S <[email protected]>
---
Documentation/ABI/testing/sysfs-block | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
index c1eb41c..0fb9fef 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -206,3 +206,15 @@ Description:
when a discarded area is read the discard_zeroes_data
parameter will be set to one. Otherwise it will be 0 and
the result of reading a discarded area is undefined.
+
+What: /sys/block/<disk>/queue/expedite_demandpaging
+What: /sys/block/<disk>/queue/expedite_swapin
+Date: April 2012
+Contact: Venkatraman S <[email protected]>
+Description:
+ For latency improvements, some storage devices could
+ provide a mechanism for servicing demand paging and
+ swapin requests in a high priority manner. Setting
+ these flags to 1 would get the requests marked with
+ REQ_RW_DMPG or REQ_RW_SWAPIN to be moved to the front
+ of elevator queue.
--
1.7.10.rc2

2012-05-03 14:24:41

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 08/16] mmc: core: add preemptibility tracking fields to mmc command

Set a preemptibility command atrribute to MMC commands. This
can be later used by write (multi block), trim etc for
evaluating if a HPI is applicable.

Note the starting time of executing a command so a decision
can be made if it is too late for preemption.

Signed-off-by: Venkatraman S <[email protected]>
---
drivers/mmc/core/core.c | 5 +++++
include/linux/mmc/core.h | 4 ++++
2 files changed, 9 insertions(+)

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index c4cd6fb..b4152ca 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -258,6 +258,11 @@ static int __mmc_start_req(struct mmc_host *host, struct mmc_request *mrq)
complete(&mrq->completion);
return -ENOMEDIUM;
}
+ if (mmc_is_preemptible_command(mrq->cmd))
+ mrq->cmd->cmd_attr |= MMC_CMD_PREEMPTIBLE;
+ else
+ mrq->cmd->cmd_attr &= ~MMC_CMD_PREEMPTIBLE;
+ mrq->cmd->started_time = jiffies;
mmc_start_request(host, mrq);
return 0;
}
diff --git a/include/linux/mmc/core.h b/include/linux/mmc/core.h
index 680e256..d86144e 100644
--- a/include/linux/mmc/core.h
+++ b/include/linux/mmc/core.h
@@ -76,6 +76,10 @@ struct mmc_command {
*/
#define mmc_cmd_type(cmd) ((cmd)->flags & MMC_CMD_MASK)

+ unsigned int cmd_attr; /*Runtime attributes of the command */
+#define MMC_CMD_PREEMPTIBLE BIT(0)
+#define MMC_CMD_PREEMPTED BIT(1)
+ unsigned long started_time;
unsigned int retries; /* max number of retries */
unsigned int error; /* command error */

--
1.7.10.rc2

2012-05-03 14:24:47

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 09/16] mmc: core: Add MMC abort interface

HPI (and possibly other) procedures require that an ongoing
mmc request issued to a controller be aborted in the middle
of a transaction. Define a abort interface function to the
controller so that individual host controllers can safely
abort a request, stop the dma and cleanup their statemachine
etc. The implementation is controller dependant

Signed-off-by: Venkatraman S <[email protected]>
---
drivers/mmc/core/core.c | 8 ++++++++
include/linux/mmc/host.h | 1 +
2 files changed, 9 insertions(+)

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index b4152ca..3f0e927 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -328,6 +328,14 @@ static void mmc_post_req(struct mmc_host *host, struct mmc_request *mrq,
}
}

+static int mmc_abort_req(struct mmc_host *host, struct mmc_request *req)
+{
+ if (host->ops->abort_req)
+ return host->ops->abort_req(host, req);
+
+ return -ENOSYS;
+}
+
/**
* mmc_start_req - start a non-blocking request
* @host: MMC host to start command
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index 0707d22..d700703 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -98,6 +98,7 @@ struct mmc_host_ops {
int err);
void (*pre_req)(struct mmc_host *host, struct mmc_request *req,
bool is_first_req);
+ int (*abort_req)(struct mmc_host *host, struct mmc_request *req);
void (*request)(struct mmc_host *host, struct mmc_request *req);
/*
* Avoid calling these three functions too often or in a "fast path",
--
1.7.10.rc2

2012-05-03 14:24:58

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 12/16] mmc: sysfs: Add sysfs entry for tuning preempt_time_threshold

When High Priority Interrupt (HPI) is enabled, ongoing requests
might be preempted. It is worthwhile to not preempt some requests
which have progressed in the underlying driver for some time.

The threshold of elapsed time after which HPI is not useful can
be tuned on a per-device basis, using the hpi_time_threshold
sysfs entry.

Signed-off-by: Venkatraman S <[email protected]>
---
drivers/mmc/core/mmc.c | 25 +++++++++++++++++++++++++
include/linux/mmc/card.h | 1 +
2 files changed, 26 insertions(+)

diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index 54df5ad..b7dbea1 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -624,6 +624,30 @@ MMC_DEV_ATTR(enhanced_area_offset, "%llu\n",
card->ext_csd.enhanced_area_offset);
MMC_DEV_ATTR(enhanced_area_size, "%u\n", card->ext_csd.enhanced_area_size);

+static ssize_t mmc_hpi_threhold_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct mmc_card *card = mmc_dev_to_card(dev);
+ return sprintf(buf, "%d\n", card->preempt_time_threshold);
+}
+
+static ssize_t mmc_hpi_threshold_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long threshold;
+ struct mmc_card *card = mmc_dev_to_card(dev);
+
+ if (kstrtoul(buf, 0, &threshold))
+ return -EINVAL;
+ if (threshold)
+ card->preempt_time_threshold = threshold;
+ return count;
+}
+
+DEVICE_ATTR(hpi_time_threshold, S_IRWXU, mmc_hpi_threhold_show,
+ mmc_hpi_threshold_store);
+
static struct attribute *mmc_std_attrs[] = {
&dev_attr_cid.attr,
&dev_attr_csd.attr,
@@ -638,6 +662,7 @@ static struct attribute *mmc_std_attrs[] = {
&dev_attr_serial.attr,
&dev_attr_enhanced_area_offset.attr,
&dev_attr_enhanced_area_size.attr,
+ &dev_attr_hpi_time_threshold.attr,
NULL,
};

diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
index 629b823..2a0da29 100644
--- a/include/linux/mmc/card.h
+++ b/include/linux/mmc/card.h
@@ -245,6 +245,7 @@ struct mmc_card {
unsigned int erase_shift; /* if erase unit is power 2 */
unsigned int pref_erase; /* in sectors */
u8 erased_byte; /* value of erased bytes */
+ unsigned int preempt_time_threshold; /* ms for checking hpi usage */

u32 raw_cid[4]; /* raw card CID */
u32 raw_csd[4]; /* raw card CSD */
--
1.7.10.rc2

2012-05-03 14:25:01

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 14/16] mmc: block: Implement HPI invocation and handling logic.

Intercept command which require high priority treatment.
If the ongoing command can be preempted according to JEDEC HPI
definition and sufficient window exist to complete an ongoing
request, invoke HPI and abort the current request, and issue
the high priority request.
Otherwise, process the command normally.

Signed-off-by: Venkatraman S <[email protected]>
---
drivers/mmc/card/block.c | 131 +++++++++++++++++++++++++++++++++++++++++++---
drivers/mmc/card/queue.h | 1 +
2 files changed, 124 insertions(+), 8 deletions(-)

diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index 11833e4..3dd662b 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -1276,7 +1276,7 @@ static int mmc_blk_cmd_err(struct mmc_blk_data *md, struct mmc_card *card,
return ret;
}

-static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
+static int mmc_blk_execute_rw_rq(struct mmc_queue *mq, struct request *rqc)
{
struct mmc_blk_data *md = mq->data;
struct mmc_card *card = md->queue.card;
@@ -1285,22 +1285,31 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
enum mmc_blk_status status;
struct mmc_queue_req *mq_rq;
struct request *req;
- struct mmc_async_req *areq;
+ struct mmc_async_req *prev_req, *cur_req;

if (!rqc && !mq->mqrq_prev->req)
return 0;

+ mq->mqrq_interrupted = NULL;
+
do {
if (rqc) {
mmc_blk_rw_rq_prep(mq->mqrq_cur, card, 0, mq);
- areq = &mq->mqrq_cur->mmc_active;
- } else
- areq = NULL;
- areq = mmc_start_req(card->host, areq, (int *) &status);
- if (!areq)
+ cur_req = &mq->mqrq_cur->mmc_active;
+ } else {
+ cur_req = NULL;
+ }
+ prev_req = mmc_start_req(card->host, cur_req, (int *) &status);
+ if (!prev_req)
return 0;

- mq_rq = container_of(areq, struct mmc_queue_req, mmc_active);
+ if (cur_req &&
+ cur_req->mrq->cmd->cmd_attr & MMC_CMD_PREEMPTIBLE) {
+ mq->mqrq_interrupted = mq->mqrq_cur;
+ }
+
+ mq_rq = container_of(prev_req,
+ struct mmc_queue_req, mmc_active);
brq = &mq_rq->brq;
req = mq_rq->req;
type = rq_data_dir(req) == READ ? MMC_BLK_READ : MMC_BLK_WRITE;
@@ -1406,6 +1415,112 @@ static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *rqc)
return 0;
}

+#define HPI_CHECK (REQ_RW_SWAPIN | REQ_RW_DMPG)
+
+static bool mmc_can_do_foreground_hpi(struct mmc_queue *mq,
+ struct request *req, unsigned int thpi)
+{
+
+ /*
+ * If some time has elapsed since the issuing of previous write
+ * command, or if the size of the request was too small, there's
+ * no point in preempting it. Check if it's worthwhile to preempt
+ */
+ int time_elapsed = jiffies_to_msecs(jiffies -
+ mq->mqrq_cur->mmc_active.mrq->cmd->started_time);
+
+ if (time_elapsed <= thpi)
+ return true;
+
+ return false;
+}
+
+/*
+ * When a HPI command had been given for a foreground
+ * request, the host controller will finish the request,
+ * the completion request has to be handled differently
+ */
+static struct mmc_async_req *mmc_handle_aborted_request(struct mmc_queue *mq,
+ int hpi_err)
+{
+ struct mmc_async_req *areq;
+ struct mmc_request *mrq;
+ struct mmc_queue_req *mq_rq;
+ struct mmc_blk_data *md = mq->data;
+ struct request *req;
+
+ BUG_ON(!mq->mqrq_interrupted);
+
+ areq = &mq->mqrq_interrupted->mmc_active;
+ mrq = areq->mrq;
+
+ /* Error checking is TBD */
+ mq_rq = container_of(areq, struct mmc_queue_req, mmc_active);
+ req = mq_rq->req;
+ mmc_queue_bounce_post(mq_rq);
+
+ spin_lock_irq(&md->lock);
+ /*
+ * TODO. Do the error translation as done in
+ * blk_err_check here and propogate
+ * the partial transfer status if applicable
+ */
+ __blk_end_request(req, -EIO, 0);
+ spin_unlock_irq(&md->lock);
+ return areq;
+}
+
+static int mmc_blk_issue_rw_rq(struct mmc_queue *mq, struct request *req)
+{
+ int ret;
+ struct mmc_blk_data *md = mq->data;
+ struct mmc_card *card = md->queue.card;
+ struct mmc_async_req *areq;
+
+ if (req && md->flags & MMC_HPI_SUPPORT) {
+ if (!((req->cmd_flags & HPI_CHECK) && mq->mqrq_interrupted))
+ goto no_preempt;
+ if (!mmc_can_do_foreground_hpi(mq, req,
+ card->preempt_time_threshold))
+ goto no_preempt;
+
+ pr_debug("Pre-empting ongoing request %pK\n",
+ mq->mqrq_interrupted);
+
+ ret = mmc_preempt_foreground_request(card,
+ mq->mqrq_interrupted->mmc_active.mrq);
+ if (ret)
+ /*
+ * Couldn't execute HPI, or the request could
+ * have been completed already. Go through
+ * the normal route
+ */
+ goto no_preempt;
+
+ areq = mmc_handle_aborted_request(mq, ret);
+ /*
+ * Remove the request from the host controller's
+ * request queue. This prevents normal error handling
+ * and retry procedures from executing (we know the
+ * request has been aborted anyway). This also helps to start
+ * the urgent requests without doing the post processing
+ * of the aborted request
+ */
+ card->host->areq = NULL;
+ /*
+ * Now the decks are clear to send the most urgent command.
+ * As we've preempted the ongoing one already, the urgent
+ * one can go through the normal queue and it won't face much
+ * resistance - hence the intentional fall through
+ */
+ BUG_ON(areq != &mq->mqrq_interrupted->mmc_active);
+
+ }
+no_preempt:
+ ret = mmc_blk_execute_rw_rq(mq, req);
+ return ret;
+}
+
static int mmc_blk_issue_rq(struct mmc_queue *mq, struct request *req)
{
int ret;
diff --git a/drivers/mmc/card/queue.h b/drivers/mmc/card/queue.h
index d2a1eb4..7bd599e 100644
--- a/drivers/mmc/card/queue.h
+++ b/drivers/mmc/card/queue.h
@@ -33,6 +33,7 @@ struct mmc_queue {
struct mmc_queue_req mqrq[2];
struct mmc_queue_req *mqrq_cur;
struct mmc_queue_req *mqrq_prev;
+ struct mmc_queue_req *mqrq_interrupted;
};

extern int mmc_init_queue(struct mmc_queue *, struct mmc_card *, spinlock_t *,
--
1.7.10.rc2

2012-05-03 14:25:14

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 15/16] mmc: Update preempted request with CORRECTLY_PRG_SECTORS_NUM info

Ongoing request that was preempted during 'programming' state is partially
completed. Number of correctly programmed sectors is available in the
ext_csd field CORRECTLY_PRG_SECTORS_NUM. Read this field to update
the bytes_xfered field of the request

Signed-off-by: Venkatraman S <[email protected]>
---
drivers/mmc/core/core.c | 26 ++++++++++++++++++++++++--
include/linux/mmc/mmc.h | 4 ++++
2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index e6430f8..354dd7a 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -122,6 +122,23 @@ static inline void mmc_should_fail_request(struct mmc_host *host,

#endif /* CONFIG_FAIL_MMC_REQUEST */

+static int mmc_get_programmed_sectors(struct mmc_card *card, int *nsectors)
+{
+ int err;
+ u8 ext_csd[512];
+
+ mmc_claim_host(card->host);
+ err = mmc_send_ext_csd(card, ext_csd);
+ mmc_release_host(card->host);
+ if (err)
+ return err;
+ *nsectors = ext_csd[EXT_CSD_C_PRG_SECTORS_NUM0] +
+ (ext_csd[EXT_CSD_C_PRG_SECTORS_NUM1] << 7) +
+ (ext_csd[EXT_CSD_C_PRG_SECTORS_NUM2] << 15) +
+ (ext_csd[EXT_CSD_C_PRG_SECTORS_NUM3] << 23);
+
+ return 0;
+}
/**
* mmc_request_done - finish processing an MMC request
* @host: MMC host which completed request
@@ -470,6 +487,7 @@ int mmc_preempt_foreground_request(struct mmc_card *card,
struct mmc_request *req)
{
int ret;
+ int nsectors;

ret = mmc_abort_req(card->host, req);
if (ret == -ENOSYS)
@@ -490,8 +508,12 @@ int mmc_preempt_foreground_request(struct mmc_card *card,
*/
if (req->data && req->data->error) {
mmc_interrupt_hpi(card);
- /* TODO : Take out the CORRECTLY_PRG_SECTORS_NUM
- * from ext_csd and add it to the request */
+
+ ret = mmc_get_programmed_sectors(card, &nsectors);
+ if (ret)
+ req->data->bytes_xfered = 0;
+ else
+ req->data->bytes_xfered = nsectors * 512;
}

return 0;
diff --git a/include/linux/mmc/mmc.h b/include/linux/mmc/mmc.h
index ec2f195..4a3453f 100644
--- a/include/linux/mmc/mmc.h
+++ b/include/linux/mmc/mmc.h
@@ -315,6 +315,10 @@ struct _mmc_csd {
#define EXT_CSD_PWR_CL_200_360 237 /* RO */
#define EXT_CSD_PWR_CL_DDR_52_195 238 /* RO */
#define EXT_CSD_PWR_CL_DDR_52_360 239 /* RO */
+#define EXT_CSD_C_PRG_SECTORS_NUM0 242 /* RO */
+#define EXT_CSD_C_PRG_SECTORS_NUM1 243 /* RO */
+#define EXT_CSD_C_PRG_SECTORS_NUM2 244 /* RO */
+#define EXT_CSD_C_PRG_SECTORS_NUM3 245 /* RO */
#define EXT_CSD_POWER_OFF_LONG_TIME 247 /* RO */
#define EXT_CSD_GENERIC_CMD6_TIME 248 /* RO */
#define EXT_CSD_CACHE_SIZE 249 /* RO, 4 bytes */
--
1.7.10.rc2

2012-05-03 14:25:47

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 16/16] mmc: omap_hsmmc: Implement abort_req host_ops

Provide the abort_req implementation for omap_hsmmc host.

When invoked, the host controller should stop the transfer
and end the ongoing request as early as possible.

If the aborted command is a data transfer command, dma setup is
aborted and a STOP command is issued. The transfer state is
marked as an error (except when the command has almost completed
while receiving the abort request, in which case finish the command
normally).

Signed-off-by: Venkatraman S <[email protected]>
---
drivers/mmc/host/omap_hsmmc.c | 55 ++++++++++++++++++++++++++++++++++++++---
1 file changed, 51 insertions(+), 4 deletions(-)

diff --git a/drivers/mmc/host/omap_hsmmc.c b/drivers/mmc/host/omap_hsmmc.c
index d15b149..a4da478 100644
--- a/drivers/mmc/host/omap_hsmmc.c
+++ b/drivers/mmc/host/omap_hsmmc.c
@@ -177,6 +177,7 @@ struct omap_hsmmc_host {
int reqs_blocked;
int use_reg;
int req_in_progress;
+ int abort_in_progress;
unsigned int flags;
struct omap_hsmmc_next next_data;

@@ -982,6 +983,7 @@ static inline void omap_hsmmc_reset_controller_fsm(struct omap_hsmmc_host *host,
static void omap_hsmmc_do_irq(struct omap_hsmmc_host *host, int status)
{
struct mmc_data *data;
+ int err = 0;
int end_cmd = 0, end_trans = 0;

if (!host->req_in_progress) {
@@ -993,6 +995,11 @@ static void omap_hsmmc_do_irq(struct omap_hsmmc_host *host, int status)
return;
}

+ if (host->abort_in_progress) {
+ end_trans = 1;
+ end_cmd = 1;
+ }
+
data = host->data;
dev_dbg(mmc_dev(host->mmc), "IRQ Status is %x\n", status);

@@ -1021,7 +1028,7 @@ static void omap_hsmmc_do_irq(struct omap_hsmmc_host *host, int status)
if ((status & DATA_TIMEOUT) ||
(status & DATA_CRC)) {
if (host->data || host->response_busy) {
- int err = (status & DATA_TIMEOUT) ?
+ err = (status & DATA_TIMEOUT) ?
-ETIMEDOUT : -EILSEQ;

if (host->data)
@@ -1045,10 +1052,13 @@ static void omap_hsmmc_do_irq(struct omap_hsmmc_host *host, int status)

OMAP_HSMMC_WRITE(host->base, STAT, status);

- if (end_cmd || ((status & CC) && host->cmd))
+ if ((end_cmd || (status & CC)) && host->cmd)
omap_hsmmc_cmd_done(host, host->cmd);
- if ((end_trans || (status & TC)) && host->mrq)
+ if ((end_trans || (status & TC)) && host->mrq) {
+ if (data)
+ data->error = err;
omap_hsmmc_xfer_done(host, data);
+ }
}

/*
@@ -1257,7 +1267,7 @@ static void omap_hsmmc_dma_cb(int lch, u16 ch_status, void *cb_data)
}

spin_lock_irqsave(&host->irq_lock, flags);
- if (host->dma_ch < 0) {
+ if (host->dma_ch < 0 || host->abort_in_progress) {
spin_unlock_irqrestore(&host->irq_lock, flags);
return;
}
@@ -1478,6 +1488,40 @@ static void omap_hsmmc_pre_req(struct mmc_host *mmc, struct mmc_request *mrq,
mrq->data->host_cookie = 0;
}

+static int omap_hsmmc_abort_req(struct mmc_host *mmc, struct mmc_request *req)
+{
+ struct omap_hsmmc_host *host = mmc_priv(mmc);
+ unsigned long flags;
+
+ if (!host->req_in_progress) {
+ dev_dbg(mmc_dev(host->mmc), "No request to abort\n");
+ return -EINVAL;
+ }
+ if (req && req != host->mrq) {
+ dev_dbg(mmc_dev(host->mmc), "Non matching abort request\n");
+ return -EINVAL;
+ }
+ spin_lock_irqsave(&host->irq_lock, flags);
+ host->abort_in_progress = 1;
+ omap_hsmmc_disable_irq(host);
+ spin_unlock_irqrestore(&host->irq_lock, flags);
+
+ host->response_busy = 0;
+
+ if (host->data) {
+ struct mmc_data *dat = host->data;
+ omap_hsmmc_dma_cleanup(host, -EIO);
+ dev_dbg(mmc_dev(host->mmc), "Aborting Transfer\n");
+ omap_hsmmc_xfer_done(host, dat);
+ } else if (host->cmd) {
+ dev_dbg(mmc_dev(host->mmc), "Aborting Command\n");
+ omap_hsmmc_cmd_done(host, host->cmd);
+ }
+
+ dev_dbg(mmc_dev(host->mmc), "Request %pK aborted\n", req);
+ return 0;
+
+}
/*
* Request function. for read/write operation
*/
@@ -1488,6 +1532,8 @@ static void omap_hsmmc_request(struct mmc_host *mmc, struct mmc_request *req)

BUG_ON(host->req_in_progress);
BUG_ON(host->dma_ch != -1);
+ host->abort_in_progress = 0;
+
if (host->protect_card) {
if (host->reqs_blocked < 3) {
/*
@@ -1664,6 +1710,7 @@ static const struct mmc_host_ops omap_hsmmc_ops = {
.disable = omap_hsmmc_disable_fclk,
.post_req = omap_hsmmc_post_req,
.pre_req = omap_hsmmc_pre_req,
+ .abort_req = omap_hsmmc_abort_req,
.request = omap_hsmmc_request,
.set_ios = omap_hsmmc_set_ios,
.get_cd = omap_hsmmc_get_cd,
--
1.7.10.rc2

2012-05-03 14:28:15

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 13/16] mmc: Documentation: Add sysfs ABI for hpi_time_threshold

hpi_time_threshold can be set to configure elapsed time in ms,
after which an ongoing request will not be preempted.
Explain the hpi_time_threhold parameter for MMC devices.

Signed-off-by: Venkatraman S <[email protected]>
---
Documentation/ABI/testing/sysfs-devices-mmc | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-mmc b/Documentation/ABI/testing/sysfs-devices-mmc
index 5a50ab6..133dba5 100644
--- a/Documentation/ABI/testing/sysfs-devices-mmc
+++ b/Documentation/ABI/testing/sysfs-devices-mmc
@@ -19,3 +19,15 @@ Description:
is enabled, this attribute will indicate the size of enhanced
data area. If not, this attribute will be -EINVAL.
Unit KByte. Format decimal.
+
+What: /sys/devices/.../mmc_host/mmcX/mmcX:XXXX/hpi_time_threshold
+Date: April 2012
+Contact: Venkatraman S <[email protected]>
+Description:
+ High Priority Interrupt is a new feature defined in eMMC4.4
+ standard. If this feature is enabled, stack needs to decide
+ till what time since the last issued request is considered
+ preemptible. This attribute value (in milliseconds) is
+ used for arriving at the most optimal value for a specific
+ card. Default is zero, which also disables the feature, as
+ the request becomes non-preemptible immediately.
--
1.7.10.rc2

2012-05-03 14:29:00

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 10/16] mmc: block: Detect HPI support in card and host controller

If both the card and host controller support HPI related
operations, set a flag in MMC queue to remember it.

Signed-off-by: Venkatraman S <[email protected]>
---
drivers/mmc/card/block.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
index dabec55..11833e4 100644
--- a/drivers/mmc/card/block.c
+++ b/drivers/mmc/card/block.c
@@ -88,6 +88,7 @@ struct mmc_blk_data {
unsigned int flags;
#define MMC_BLK_CMD23 (1 << 0) /* Can do SET_BLOCK_COUNT for multiblock */
#define MMC_BLK_REL_WR (1 << 1) /* MMC Reliable write support */
+#define MMC_HPI_SUPPORT (1 << 2)

unsigned int usage;
unsigned int read_only;
@@ -1548,12 +1549,15 @@ static struct mmc_blk_data *mmc_blk_alloc_req(struct mmc_card *card,
md->flags |= MMC_BLK_CMD23;
}

- if (mmc_card_mmc(card) &&
- md->flags & MMC_BLK_CMD23 &&
+ if (mmc_card_mmc(card)) {
+ if (md->flags & MMC_BLK_CMD23 &&
((card->ext_csd.rel_param & EXT_CSD_WR_REL_PARAM_EN) ||
card->ext_csd.rel_sectors)) {
- md->flags |= MMC_BLK_REL_WR;
- blk_queue_flush(md->queue.queue, REQ_FLUSH | REQ_FUA);
+ md->flags |= MMC_BLK_REL_WR;
+ blk_queue_flush(md->queue.queue, REQ_FLUSH | REQ_FUA);
+ }
+ if (card->host->ops->abort_req && card->ext_csd.hpi_en)
+ md->flags |= MMC_HPI_SUPPORT;
}

return md;
--
1.7.10.rc2

2012-05-03 14:29:29

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 11/16] mmc: core: Implement foreground request preemption procedure

When invoked, ongoing command at the host controller should abort
and completion should be invoked.

It's quite possible that the interruption will race with the
successful completion of the command. If so, HPI is invoked
only when the low level driver sets an error flag for the
aborted request.

Signed-off-by: Venkatraman S <[email protected]>
---
drivers/mmc/core/core.c | 32 ++++++++++++++++++++++++++++++++
include/linux/mmc/core.h | 2 ++
2 files changed, 34 insertions(+)

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 3f0e927..e6430f8 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -466,6 +466,38 @@ out:
}
EXPORT_SYMBOL(mmc_interrupt_hpi);

+int mmc_preempt_foreground_request(struct mmc_card *card,
+ struct mmc_request *req)
+{
+ int ret;
+
+ ret = mmc_abort_req(card->host, req);
+ if (ret == -ENOSYS)
+ return ret;
+ /*
+ * Whether or not abort was successful, the command is
+ * still under the host controller's context.
+ * Should wait for the completion to be returned.
+ */
+ wait_for_completion(&req->completion);
+ /*
+ * Checkpoint the aborted request.
+ * If error is set, the request completed partially,
+ * and the ext_csd field "CORRECTLY_PRG_SECTORS_NUM"
+ * contains the number of blocks written to the device.
+ * If error is not set, the request was completed
+ * successfully and there is no need to try it again.
+ */
+ if (req->data && req->data->error) {
+ mmc_interrupt_hpi(card);
+ /* TODO : Take out the CORRECTLY_PRG_SECTORS_NUM
+ * from ext_csd and add it to the request */
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(mmc_preempt_foreground_request);
+
/**
* mmc_wait_for_cmd - start a command and wait for completion
* @host: MMC host to start command
diff --git a/include/linux/mmc/core.h b/include/linux/mmc/core.h
index d86144e..e2d55c6 100644
--- a/include/linux/mmc/core.h
+++ b/include/linux/mmc/core.h
@@ -144,6 +144,8 @@ extern struct mmc_async_req *mmc_start_req(struct mmc_host *,
extern int mmc_interrupt_hpi(struct mmc_card *);
extern void mmc_wait_for_req(struct mmc_host *, struct mmc_request *);
extern int mmc_wait_for_cmd(struct mmc_host *, struct mmc_command *, int);
+extern int mmc_preempt_foreground_request(struct mmc_card *card,
+ struct mmc_request *req);
extern int mmc_app_cmd(struct mmc_host *, struct mmc_card *);
extern int mmc_wait_for_app_cmd(struct mmc_host *, struct mmc_card *,
struct mmc_command *, int);
--
1.7.10.rc2

2012-05-03 14:24:39

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 07/16] mmc: core: helper function for finding preemptible command

According to table30 in eMMC spec, only some commands
can be preempted by foreground HPI. Provide a helper function
for the HPI procedure to identify if the command is
preemptible.

Signed-off-by: Venkatraman S <[email protected]>
---
include/linux/mmc/core.h | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/include/linux/mmc/core.h b/include/linux/mmc/core.h
index 1b431c7..680e256 100644
--- a/include/linux/mmc/core.h
+++ b/include/linux/mmc/core.h
@@ -10,6 +10,7 @@

#include <linux/interrupt.h>
#include <linux/completion.h>
+#include <linux/mmc/mmc.h>

struct request;
struct mmc_data;
@@ -192,6 +193,18 @@ static inline void mmc_claim_host(struct mmc_host *host)
__mmc_claim_host(host, NULL);
}

+static inline bool mmc_is_preemptible_command(struct mmc_command *cmd)
+{
+ if ((cmd->opcode == MMC_SWITCH && (cmd->arg == EXT_CSD_BKOPS_START ||
+ cmd->arg == EXT_CSD_SANITIZE_START ||
+ cmd->arg == EXT_CSD_FLUSH_CACHE))
+ || (cmd->opcode == MMC_ERASE)
+ || (cmd->opcode == MMC_WRITE_MULTIPLE_BLOCK)
+ || (cmd->opcode == MMC_WRITE_BLOCK))
+ return true;
+ return false;
+}
+
extern u32 mmc_vddrange_to_ocrmask(int vdd_min, int vdd_max);

#endif /* LINUX_MMC_CORE_H */
--
1.7.10.rc2

2012-05-03 14:24:36

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 04/16] block: add sysfs attributes for runtime control of dpmg and swapin

sysfs entries for DPMG and SWAPIN requests so that they can
be set/reset from userspace.

Signed-off-by: Venkatraman S <[email protected]>
---
block/blk-sysfs.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index cf15001..764de9f 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -213,6 +213,8 @@ queue_store_##name(struct request_queue *q, const char *page, size_t count) \
}

QUEUE_SYSFS_BIT_FNS(nonrot, NONROT, 1);
+QUEUE_SYSFS_BIT_FNS(expedite_dmpg, EXP_DMPG, 0);
+QUEUE_SYSFS_BIT_FNS(expedite_swapin, EXP_SWAPIN, 0);
QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
#undef QUEUE_SYSFS_BIT_FNS
@@ -387,6 +389,18 @@ static struct queue_sysfs_entry queue_random_entry = {
.store = queue_store_random,
};

+static struct queue_sysfs_entry queue_dmpg_entry = {
+ .attr = {.name = "expedite_demandpaging", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_show_expedite_dmpg,
+ .store = queue_store_expedite_dmpg,
+};
+
+static struct queue_sysfs_entry queue_swapin_entry = {
+ .attr = {.name = "expedite_swapping", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_show_expedite_swapin,
+ .store = queue_store_expedite_swapin,
+};
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -409,6 +423,8 @@ static struct attribute *default_attrs[] = {
&queue_rq_affinity_entry.attr,
&queue_iostats_entry.attr,
&queue_random_entry.attr,
+ &queue_dmpg_entry.attr,
+ &queue_swapin_entry.attr,
NULL,
};

--
1.7.10.rc2

2012-05-03 14:30:32

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 06/16] block: treat DMPG and SWAPIN requests as special

From: Ilan Smith <[email protected]>

When exp_swapin and exp_dmpg are set, treat read requests
marked with DMPG and SWAPIN as high priority and move to
the front of the queue.

Signed-off-by: Ilan Smith <[email protected]>
Signed-off-by: Alex Lemberg <[email protected]>
Signed-off-by: Venkatraman S <[email protected]>
---
block/blk-core.c | 18 ++++++++++++++++++
block/elevator.c | 14 +++++++++++++-
2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1f61b74..7a1b98b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1306,6 +1306,12 @@ void init_request_from_bio(struct request *req, struct bio *bio)
if (bio->bi_rw & REQ_RAHEAD)
req->cmd_flags |= REQ_FAILFAST_MASK;

+ if (bio_swapin(bio) && blk_queue_exp_swapin(req->q))
+ req->cmd_flags |= REQ_RW_SWAPIN | REQ_NOMERGE;
+
+ if (bio_dmpg(bio) && blk_queue_exp_dmpg(req->q))
+ req->cmd_flags |= REQ_RW_DMPG | REQ_NOMERGE;
+
req->errors = 0;
req->__sector = bio->bi_sector;
req->ioprio = bio_prio(bio);
@@ -1333,6 +1339,18 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio)
goto get_rq;
}

+ if (bio_swapin(bio) && blk_queue_exp_swapin(q)) {
+ spin_lock_irq(q->queue_lock);
+ where = ELEVATOR_INSERT_FLUSH;
+ goto get_rq;
+ }
+
+ if (bio_dmpg(bio) && blk_queue_exp_dmpg(q)) {
+ spin_lock_irq(q->queue_lock);
+ where = ELEVATOR_INSERT_FLUSH;
+ goto get_rq;
+ }
+
/*
* Check if we can merge with the plugged list before grabbing
* any locks.
diff --git a/block/elevator.c b/block/elevator.c
index f016855..76d571b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -367,7 +367,8 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
q->nr_sorted--;

boundary = q->end_sector;
- stop_flags = REQ_SOFTBARRIER | REQ_STARTED;
+ stop_flags = REQ_SOFTBARRIER | REQ_STARTED
+ | REQ_RW_SWAPIN | REQ_RW_DMPG ;
list_for_each_prev(entry, &q->queue_head) {
struct request *pos = list_entry_rq(entry);

@@ -585,6 +586,17 @@ void elv_quiesce_end(struct request_queue *q)

void __elv_add_request(struct request_queue *q, struct request *rq, int where)
{
+ unsigned int hpi_flags = REQ_RW_DMPG | REQ_RW_SWAPIN;
+
+ if (rq->cmd_flags & hpi_flags) {
+ /*
+ * Insert swap-in or demand page requests at the front. This
+ * causes them to be queued in the reversed order.
+ */
+ where = ELEVATOR_INSERT_FRONT;
+ } else
+
+
trace_block_rq_insert(q, rq);

rq->q = q;
--
1.7.10.rc2

2012-05-03 14:24:13

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 02/16] MM: Added page swapping markers to memory management

From: Ilan Smith <[email protected]>

Add attribute to identify swapin requests
Mark memory management requests with swapin requests

Signed-off-by: Ilan Smith <[email protected]>
Signed-off-by: Alex Lemberg <[email protected]>
Signed-off-by: Venkatraman S <[email protected]>
---
include/linux/bio.h | 1 +
include/linux/blk_types.h | 2 ++
mm/page_io.c | 3 ++-
3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 264e0ef..8494b2f 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -63,6 +63,7 @@ static inline bool bio_rw_flagged(struct bio *bio, unsigned long flag)
}

#define bio_dmpg(bio) bio_rw_flagged(bio, REQ_RW_DMPG)
+#define bio_swapin(bio) bio_rw_flagged(bio, REQ_RW_SWAPIN)

/*
* various member access, note that bio_data should of course not be used
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 87feb80..df2b9ea 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -151,6 +151,7 @@ enum rq_flag_bits {
__REQ_IO_STAT, /* account I/O stat */
__REQ_MIXED_MERGE, /* merge of different types, fail separately */
__REQ_RW_DMPG,
+ __REQ_RW_SWAPIN,
__REQ_NR_BITS, /* stops here */
};

@@ -193,5 +194,6 @@ enum rq_flag_bits {
#define REQ_MIXED_MERGE (1 << __REQ_MIXED_MERGE)
#define REQ_SECURE (1 << __REQ_SECURE)
#define REQ_RW_DMPG (1 << __REQ_RW_DMPG)
+#define REQ_RW_SWAPIN (1 << __REQ_RW_SWAPIN)

#endif /* __LINUX_BLK_TYPES_H */
diff --git a/mm/page_io.c b/mm/page_io.c
index dc76b4d..a148bea 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -128,8 +128,9 @@ int swap_readpage(struct page *page)
ret = -ENOMEM;
goto out;
}
+ bio->bi_rw |= REQ_RW_SWAPIN;
count_vm_event(PSWPIN);
- submit_bio(READ, bio);
+ submit_bio(READ | REQ_RW_SWAPIN, bio);
out:
return ret;
}
--
1.7.10.rc2

2012-05-03 14:31:33

by Venkatraman S

[permalink] [raw]
Subject: [PATCH v2 01/16] FS: Added demand paging markers to filesystem

From: Ilan Smith <[email protected]>

Add attribute to identify demand paging requests.
Mark readpages with demand paging attribute.

Signed-off-by: Ilan Smith <[email protected]>
Signed-off-by: Alex Lemberg <[email protected]>
Signed-off-by: Venkatraman S <[email protected]>
---
fs/mpage.c | 2 ++
include/linux/bio.h | 7 +++++++
include/linux/blk_types.h | 2 ++
3 files changed, 11 insertions(+)

diff --git a/fs/mpage.c b/fs/mpage.c
index 0face1c..8b144f5 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -386,6 +386,8 @@ mpage_readpages(struct address_space *mapping, struct list_head *pages,
&last_block_in_bio, &map_bh,
&first_logical_block,
get_block);
+ if (bio)
+ bio->bi_rw |= REQ_RW_DMPG;
}
page_cache_release(page);
}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 4d94eb8..264e0ef 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -57,6 +57,13 @@
(bio)->bi_rw |= ((unsigned long) (prio) << BIO_PRIO_SHIFT); \
} while (0)

+static inline bool bio_rw_flagged(struct bio *bio, unsigned long flag)
+{
+ return ((bio->bi_rw & flag) != 0);
+}
+
+#define bio_dmpg(bio) bio_rw_flagged(bio, REQ_RW_DMPG)
+
/*
* various member access, note that bio_data should of course not be used
* on highmem page vectors
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4053cbd..87feb80 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -150,6 +150,7 @@ enum rq_flag_bits {
__REQ_FLUSH_SEQ, /* request for flush sequence */
__REQ_IO_STAT, /* account I/O stat */
__REQ_MIXED_MERGE, /* merge of different types, fail separately */
+ __REQ_RW_DMPG,
__REQ_NR_BITS, /* stops here */
};

@@ -191,5 +192,6 @@ enum rq_flag_bits {
#define REQ_IO_STAT (1 << __REQ_IO_STAT)
#define REQ_MIXED_MERGE (1 << __REQ_MIXED_MERGE)
#define REQ_SECURE (1 << __REQ_SECURE)
+#define REQ_RW_DMPG (1 << __REQ_RW_DMPG)

#endif /* __LINUX_BLK_TYPES_H */
--
1.7.10.rc2

2012-05-03 14:38:43

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH v2 06/16] block: treat DMPG and SWAPIN requests as special

Venkatraman S <[email protected]> writes:

> From: Ilan Smith <[email protected]>
>
> When exp_swapin and exp_dmpg are set, treat read requests
> marked with DMPG and SWAPIN as high priority and move to
> the front of the queue.
>
[...]
> + if (bio_swapin(bio) && blk_queue_exp_swapin(q)) {
> + spin_lock_irq(q->queue_lock);
> + where = ELEVATOR_INSERT_FLUSH;
> + goto get_rq;
> + }
> +
> + if (bio_dmpg(bio) && blk_queue_exp_dmpg(q)) {
> + spin_lock_irq(q->queue_lock);
> + where = ELEVATOR_INSERT_FLUSH;
> + goto get_rq;

Is ELEVATOR_INSERT_FRONT not good enough? It seems wrong to use _FLUSH,
here. If the semantics of ELEVATOR_INSERT_FLUSH are really what is
required, then perhaps we need to have another think about the naming of
these flags.

Cheers,
Jeff

2012-05-03 16:22:57

by Venkatraman S

[permalink] [raw]
Subject: Re: [PATCH v2 06/16] block: treat DMPG and SWAPIN requests as special

On Thu, May 3, 2012 at 8:08 PM, Jeff Moyer <[email protected]> wrote:
> Venkatraman S <[email protected]> writes:
>
>> From: Ilan Smith <[email protected]>
>>
>> When exp_swapin and exp_dmpg are set, treat read requests
>> marked with DMPG and SWAPIN as high priority and move to
>> the front of the queue.
>>
> [...]
>> + ? ? if (bio_swapin(bio) && blk_queue_exp_swapin(q)) {
>> + ? ? ? ? ? ? spin_lock_irq(q->queue_lock);
>> + ? ? ? ? ? ? where = ELEVATOR_INSERT_FLUSH;
>> + ? ? ? ? ? ? goto get_rq;
>> + ? ? }
>> +
>> + ? ? if (bio_dmpg(bio) && blk_queue_exp_dmpg(q)) {
>> + ? ? ? ? ? ? spin_lock_irq(q->queue_lock);
>> + ? ? ? ? ? ? where = ELEVATOR_INSERT_FLUSH;
>> + ? ? ? ? ? ? goto get_rq;
>
> Is ELEVATOR_INSERT_FRONT not good enough? ?It seems wrong to use _FLUSH,
> here. ?If the semantics of ELEVATOR_INSERT_FLUSH are really what is
> required, then perhaps we need to have another think about the naming of
> these flags.
>
Actually - yes, ELEVATOR_INSERT_FRONT would do as well. In the
previous version of MMC stack,
we needed the _FLUSH to trigger the write operation that was to be
preempted, to check that
it actually works.


> Cheers,
> Jeff
>
> --

2012-05-06 23:31:46

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] FS: Added demand paging markers to filesystem

On Thu, May 03, 2012 at 07:53:00PM +0530, Venkatraman S wrote:
> From: Ilan Smith <[email protected]>
>
> Add attribute to identify demand paging requests.
> Mark readpages with demand paging attribute.
>
> Signed-off-by: Ilan Smith <[email protected]>
> Signed-off-by: Alex Lemberg <[email protected]>
> Signed-off-by: Venkatraman S <[email protected]>
> ---
> fs/mpage.c | 2 ++
> include/linux/bio.h | 7 +++++++
> include/linux/blk_types.h | 2 ++
> 3 files changed, 11 insertions(+)
>
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 0face1c..8b144f5 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -386,6 +386,8 @@ mpage_readpages(struct address_space *mapping, struct list_head *pages,
> &last_block_in_bio, &map_bh,
> &first_logical_block,
> get_block);
> + if (bio)
> + bio->bi_rw |= REQ_RW_DMPG;

Have you thought about the potential for DOSing a machine
with this? That is, user data reads can now preempt writes of any
kind, effectively stalling writeback and memory reclaim which will
lead to OOM situations. Or, alternatively, journal flushing will get
stalled and no new modifications can take place until the read
stream stops.

This really seems like functionality that belongs in an IO
scheduler so that write starvation can be avoided, not in high-level
data read paths where we have no clue about anything else going on
in the IO subsystem....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-05-07 16:47:05

by Venkatraman S

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] FS: Added demand paging markers to filesystem

Mon, May 7, 2012 at 5:01 AM, Dave Chinner <[email protected]> wrote:
> On Thu, May 03, 2012 at 07:53:00PM +0530, Venkatraman S wrote:
>> From: Ilan Smith <[email protected]>
>>
>> Add attribute to identify demand paging requests.
>> Mark readpages with demand paging attribute.
>>
>> Signed-off-by: Ilan Smith <[email protected]>
>> Signed-off-by: Alex Lemberg <[email protected]>
>> Signed-off-by: Venkatraman S <[email protected]>
>> ---
>> ?fs/mpage.c ? ? ? ? ? ? ? ?| ? ?2 ++
>> ?include/linux/bio.h ? ? ? | ? ?7 +++++++
>> ?include/linux/blk_types.h | ? ?2 ++
>> ?3 files changed, 11 insertions(+)
>>
>> diff --git a/fs/mpage.c b/fs/mpage.c
>> index 0face1c..8b144f5 100644
>> --- a/fs/mpage.c
>> +++ b/fs/mpage.c
>> @@ -386,6 +386,8 @@ mpage_readpages(struct address_space *mapping, struct list_head *pages,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &last_block_in_bio, &map_bh,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &first_logical_block,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? get_block);
>> + ? ? ? ? ? ? ? ? ? ? if (bio)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? bio->bi_rw |= REQ_RW_DMPG;
>
> Have you thought about the potential for DOSing a machine
> with this? That is, user data reads can now preempt writes of any
> kind, effectively stalling writeback and memory reclaim which will
> lead to OOM situations. Or, alternatively, journal flushing will get
> stalled and no new modifications can take place until the read
> stream stops.

This feature doesn't fiddle with the I/O scheduler's ability to balance
read vs write requests or handling requests from various process queues (CFQ).

Also, for block devices which don't implement the ability to preempt (and even
for older versions of MMC devices which don't implement this feature),
the behaviour
falls back to waiting for write requests to complete before issuing the read.

In low end flash devices, some requests might take too long than normal
due to background device maintenance (i.e flash erase / reclaim procedure)
kicking in in the context of an ongoing write, stalling them by several
orders of magnitude.

This implementation (See 14/16) does have several
checks and timers to see that it's not triggered very often.
In my tests, where I usually have a generous preemption time window, the abort
happens < 0.1% of the time.


>
> This really seems like functionality that belongs in an IO
> scheduler so that write starvation can be avoided, not in high-level
> data read paths where we have no clue about anything else going on
> in the IO subsystem....

Indeed, the feature is built mostly in the low level device driver and
minor changes in the elevator. Changes above the block layer are only
about setting
attributes and transparent to their operation.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2012-05-08 07:46:45

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 00/16] [FS, MM, block, MMC]: eMMC High Priority Interrupt Feature

On 05/03/2012 11:22 PM, Venkatraman S wrote:

> Standard eMMC (Embedded MultiMedia Card) specification expects to execute
> one request at a time. If some requests are more important than others, they
> can't be aborted while the flash procedure is in progress.
>
> New versions of the eMMC standard (4.41 and above) specfies a feature
> called High Priority Interrupt (HPI). This enables an ongoing transaction
> to be aborted using a special command (HPI command) so that the card is ready
> to receive new commands immediately. Then the new request can be submitted
> to the card, and optionally the interrupted command can be resumed again.
>
> Some restrictions exist on when and how the command can be used. For example,
> only write and write-like commands (ERASE) can be preempted, and the urgent
> request must be a read.
>
> In order to support this in software,
> a) At the top level, some policy decisions have to be made on what is
> worth preempting for.
> This implementation uses the demand paging requests and swap
> read requests as potential reads worth preempting an ongoing long write.
> This is expected to provide improved responsiveness for smarphones
> with multitasking capabilities - example would be launch a email application
> while a video capture session (which causes long writes) is ongoing.


Do you have a number to prove it's really big effective?

What I have a concern is when we got low memory situation.
Then, writing speed for page reclaim is important for response.
If we allow read preempt write and write is delay, it means read path takes longer time to
get a empty buffer pages in reclaim. In such case, it couldn't be good.


--
Kind regards,
Minchan Kim

2012-05-08 16:31:32

by Venkatraman S

[permalink] [raw]
Subject: Re: [PATCHv2 00/16] [FS, MM, block, MMC]: eMMC High Priority Interrupt Feature

On Tue, May 8, 2012 at 1:16 PM, Minchan Kim <[email protected]> wrote:
> On 05/03/2012 11:22 PM, Venkatraman S wrote:
>
>> Standard eMMC (Embedded MultiMedia Card) specification expects to execute
>> one request at a time. If some requests are more important than others, they
>> can't be aborted while the flash procedure is in progress.
>>
>> New versions of the eMMC standard (4.41 and above) specfies a feature
>> called High Priority Interrupt (HPI). This enables an ongoing transaction
>> to be aborted using a special command (HPI command) so that the card is ready
>> to receive new commands immediately. Then the new request can be submitted
>> to the card, and optionally the interrupted command can be resumed again.
>>
>> Some restrictions exist on when and how the command can be used. For example,
>> only write and write-like commands (ERASE) can be preempted, and the urgent
>> request must be a read.
>>
>> In order to support this in software,
>> a) At the top level, some policy decisions have to be made on what is
>> worth preempting for.
>> ? ? ? This implementation uses the demand paging requests and swap
>> read requests as potential reads worth preempting an ongoing long write.
>> ? ? ? This is expected to provide improved responsiveness for smarphones
>> with multitasking capabilities - example would be launch a email application
>> while a video capture session (which causes long writes) is ongoing.
>
>
> Do you have a number to prove it's really big effective?

What type of benchmarks would be appropriate to post ?
As you know, the response time of a card would vary depending on
whether the flash device
has enough empty blocks to write into and doesn't have to resort to GC during
write requests.
Macro benchmarks like iozone etc would be inappropriate here, as they won't show
the latency effects of individual write requests hung up doing page
reclaim, which happens
once in a while.

>
> What I have a concern is when we got low memory situation.
> Then, writing speed for page reclaim is important for response.
> If we allow read preempt write and write is delay, it means read path takes longer time to
> get a empty buffer pages in reclaim. In such case, it couldn't be good.
>

I agree. But when writes are delayed anyway as it exceeds
hpi_time_threshold (the window
available for invoking HPI), it means that the device is in GC mode
and either read or write
could be equally delayed. Note that even in case of interrupting a
write, a single block write
(which usually is too short to be interrupted, as designed) is
sufficient for doing a page reclaim,
and further write requests (including multiblock) would not be subject
to HPI, as they will
complete within the average time.

2012-05-08 16:36:01

by Venkatraman S

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] FS: Added demand paging markers to filesystem

On Tue, May 8, 2012 at 11:58 AM, mani <[email protected]> wrote:
> How about adding the AS_DMPG flag in the file -> address_space when getting
> a filemap_fault()
> so that we can treat the page fault pages as the high priority pages over
> normal read requests.
> How about changing below lines for the support of the pages those are
> requested for the page fault ?
>
>
> --- a/fs/mpage.c 2012-05-04 12:59:12.000000000 +0530
> +++ b/fs/mpage.c 2012-05-07 13:13:49.000000000 +0530
> @@ -408,6 +408,8 @@ mpage_readpages(struct address_space *ma
> ??????????????????? &last_block_in_bio, &map_bh,
> ??????????????????? &first_logical_block,
> ??????????????????? get_block);
> +?????????? if(test_bit(AS_DMPG, &mapping->flags) && bio)
>
> +???????????????? bio->bi_rw |= REQ_RW_DMPG
> ??????? }
> ??????? page_cache_release(page);
> ??? }
> --- a/include/linux/pagemap.h??? 2012-05-04 12:57:35.000000000 +0530
> +++ b/include/linux/pagemap.h??? 2012-05-07 13:15:24.000000000 +0530
> @@ -27,6 +27,7 @@ enum mapping_flags {
> ?#if defined (CONFIG_BD_CACHE_ENABLED)
> ??? AS_DIRECT? =?? __GFP_BITS_SHIFT + 4,? /* DIRECT_IO specified on file op
> */
> ?#endif
> +?? AS_DMPG? =?? __GFP_BITS_SHIFT + 5,? /* DEMAND PAGE specified on file op
> */
> ?};
>
> ?static inline void mapping_set_error(struct address_space *mapping, int
> error)
>
> --- a/mm/filemap.c?? 2012-05-04 12:58:49.000000000 +0530
> +++ b/mm/filemap.c?? 2012-05-07 13:15:03.000000000 +0530
> @@ -1646,6 +1646,7 @@ int filemap_fault(struct vm_area_struct
> ??? if (offset >= size)
> ??????? return VM_FAULT_SIGBUS;
>
> +?? set_bit(AS_DMPG, &file->f_mapping->flags);
> ??? /*
> ???? * Do we have something in the page cache already?
> ???? */
>
> Will these changes have any adverse effect ?
>

Thanks for the example but I can't judge which of the two is the most
elegant or acceptable to maintainers.
I can test with your change and inform if it works.

> Thanks & Regards
> Manish
>
> On Mon, May 7, 2012 at 5:01 AM, Dave Chinner <[email protected]> wrote:
>>
>> On Thu, May 03, 2012 at 07:53:00PM +0530, Venkatraman S wrote:
>> > From: Ilan Smith <[email protected]>
>> >
>> > Add attribute to identify demand paging requests.
>> > Mark readpages with demand paging attribute.
>> >
>> > Signed-off-by: Ilan Smith <[email protected]>
>> > Signed-off-by: Alex Lemberg <[email protected]>
>> > Signed-off-by: Venkatraman S <[email protected]>
>> > ---
>> > ?fs/mpage.c ? ? ? ? ? ? ? ?| ? ?2 ++
>> > ?include/linux/bio.h ? ? ? | ? ?7 +++++++
>> > ?include/linux/blk_types.h | ? ?2 ++
>> > ?3 files changed, 11 insertions(+)
>> >
>> > diff --git a/fs/mpage.c b/fs/mpage.c
>> > index 0face1c..8b144f5 100644
>> > --- a/fs/mpage.c
>> > +++ b/fs/mpage.c
>> > @@ -386,6 +386,8 @@ mpage_readpages(struct address_space *mapping,
>> > struct list_head *pages,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &last_block_in_bio, &map_bh,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &first_logical_block,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? get_block);
>> > + ? ? ? ? ? ? ? ? ? ? if (bio)
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? bio->bi_rw |= REQ_RW_DMPG;
>>
>> Have you thought about the potential for DOSing a machine
>> with this? That is, user data reads can now preempt writes of any
>> kind, effectively stalling writeback and memory reclaim which will
>> lead to OOM situations. Or, alternatively, journal flushing will get
>> stalled and no new modifications can take place until the read
>> stream stops.
>>
>> This really seems like functionality that belongs in an IO
>> scheduler so that write starvation can be avoided, not in high-level
>> data read paths where we have no clue about anything else going on
>> in the IO subsystem....
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> [email protected]
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
>

2012-05-09 00:33:54

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] FS: Added demand paging markers to filesystem

On Mon, May 07, 2012 at 10:16:30PM +0530, S, Venkatraman wrote:
> Mon, May 7, 2012 at 5:01 AM, Dave Chinner <[email protected]> wrote:
> > On Thu, May 03, 2012 at 07:53:00PM +0530, Venkatraman S wrote:
> >> From: Ilan Smith <[email protected]>
> >>
> >> Add attribute to identify demand paging requests.
> >> Mark readpages with demand paging attribute.
> >>
> >> Signed-off-by: Ilan Smith <[email protected]>
> >> Signed-off-by: Alex Lemberg <[email protected]>
> >> Signed-off-by: Venkatraman S <[email protected]>
> >> ---
> >> ?fs/mpage.c ? ? ? ? ? ? ? ?| ? ?2 ++
> >> ?include/linux/bio.h ? ? ? | ? ?7 +++++++
> >> ?include/linux/blk_types.h | ? ?2 ++
> >> ?3 files changed, 11 insertions(+)
> >>
> >> diff --git a/fs/mpage.c b/fs/mpage.c
> >> index 0face1c..8b144f5 100644
> >> --- a/fs/mpage.c
> >> +++ b/fs/mpage.c
> >> @@ -386,6 +386,8 @@ mpage_readpages(struct address_space *mapping, struct list_head *pages,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &last_block_in_bio, &map_bh,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &first_logical_block,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? get_block);
> >> + ? ? ? ? ? ? ? ? ? ? if (bio)
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? bio->bi_rw |= REQ_RW_DMPG;
> >
> > Have you thought about the potential for DOSing a machine
> > with this? That is, user data reads can now preempt writes of any
> > kind, effectively stalling writeback and memory reclaim which will
> > lead to OOM situations. Or, alternatively, journal flushing will get
> > stalled and no new modifications can take place until the read
> > stream stops.
>
> This feature doesn't fiddle with the I/O scheduler's ability to balance
> read vs write requests or handling requests from various process queues (CFQ).

And for schedulers like no-op that don't do any read/write balancing?
Also, I thought the code was queuing such demand paged requests at
the front of the queues, too, so bypassing most of the read/write
balancing logic of the elevators...

> Also, for block devices which don't implement the ability to preempt (and even
> for older versions of MMC devices which don't implement this feature),
> the behaviour
> falls back to waiting for write requests to complete before issuing the read.

Sure, but my point is that you are adding a flag that will be set
for all user data read IO, and then making it priviledged in the
lower layers.

> In low end flash devices, some requests might take too long than normal
> due to background device maintenance (i.e flash erase / reclaim procedure)
> kicking in in the context of an ongoing write, stalling them by several
> orders of magnitude.

And thereby stalling what might be writes critical to operation.
Indeed, how does this affect the system when it starts swapping
heavily? If you keep stalling writes, the system won't be able to
swap and free memory...

> This implementation (See 14/16) does have several checks and
> timers to see that it's not triggered very often. In my tests,
> where I usually have a generous preemption time window, the abort
> happens < 0.1% of the time.

Yes, but seeing as the user has direct control of the pre-emption
vector, it's not hard to imagine someone using it for a timing
attack...

> > This really seems like functionality that belongs in an IO
> > scheduler so that write starvation can be avoided, not in high-level
> > data read paths where we have no clue about anything else going on
> > in the IO subsystem....
>
> Indeed, the feature is built mostly in the low level device driver and
> minor changes in the elevator. Changes above the block layer are only
> about setting
> attributes and transparent to their operation.

The problem is that the attribute you are setting covers every
single data read that is done by all users. If that's what you want
to have happen, then why do you even need a new flag at this layer?
Just treat every non-REQ_META read request as a demand paged IO and
you've got exactly the same behaviour without needing to tag at the
higher layer....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-05-09 00:45:09

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 00/16] [FS, MM, block, MMC]: eMMC High Priority Interrupt Feature

On 05/09/2012 01:31 AM, S, Venkatraman wrote:

> On Tue, May 8, 2012 at 1:16 PM, Minchan Kim <[email protected]> wrote:
>> On 05/03/2012 11:22 PM, Venkatraman S wrote:
>>
>>> Standard eMMC (Embedded MultiMedia Card) specification expects to execute
>>> one request at a time. If some requests are more important than others, they
>>> can't be aborted while the flash procedure is in progress.
>>>
>>> New versions of the eMMC standard (4.41 and above) specfies a feature
>>> called High Priority Interrupt (HPI). This enables an ongoing transaction
>>> to be aborted using a special command (HPI command) so that the card is ready
>>> to receive new commands immediately. Then the new request can be submitted
>>> to the card, and optionally the interrupted command can be resumed again.
>>>
>>> Some restrictions exist on when and how the command can be used. For example,
>>> only write and write-like commands (ERASE) can be preempted, and the urgent
>>> request must be a read.
>>>
>>> In order to support this in software,
>>> a) At the top level, some policy decisions have to be made on what is
>>> worth preempting for.
>>> This implementation uses the demand paging requests and swap
>>> read requests as potential reads worth preempting an ongoing long write.
>>> This is expected to provide improved responsiveness for smarphones
>>> with multitasking capabilities - example would be launch a email application
>>> while a video capture session (which causes long writes) is ongoing.
>>
>>
>> Do you have a number to prove it's really big effective?
>
> What type of benchmarks would be appropriate to post ?
> As you know, the response time of a card would vary depending on
> whether the flash device
> has enough empty blocks to write into and doesn't have to resort to GC during
> write requests.
> Macro benchmarks like iozone etc would be inappropriate here, as they won't show
> the latency effects of individual write requests hung up doing page
> reclaim, which happens
> once in a while.


We don't have such special benchmark so you need time to think how to prove it.
IMHO, you can use run-many-x-apps.sh which checks elapsed time to activate programs
by posting by Wu long time ago.

http://www.spinics.net/lists/linux-mm/msg09653.html

Of course, your eMMC is used above 80~90% for triggering GC stress and
your memory should be used up by dirty pages to happen reclaim.


>>
>> What I have a concern is when we got low memory situation.
>> Then, writing speed for page reclaim is important for response.
>> If we allow read preempt write and write is delay, it means read path takes longer time to
>> get a empty buffer pages in reclaim. In such case, it couldn't be good.
>>
>
> I agree. But when writes are delayed anyway as it exceeds
> hpi_time_threshold (the window
> available for invoking HPI), it means that the device is in GC mode
> and either read or write
> could be equally delayed. Note that even in case of interrupting a
> write, a single block write
> (which usually is too short to be interrupted, as designed) is
> sufficient for doing a page reclaim,
> and further write requests (including multiblock) would not be subject
> to HPI, as they will
> complete within the average time.


My point is that it would be better for read to not preempt write-for-page_reclaim.
And we can identify it by PG_reclaim. You can get the idea.

Anyway, HPI is only feature of a device of many storages and you are requiring modification
of generic layers although it's not big. So for getting justification and attracting many
core guys(MM,FS,BLOCK), you should provide data at least.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>



--
Kind regards,
Minchan Kim

2012-05-09 08:35:08

by Konstantin Dorfman

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] mmc: block: Implement HPI invocation and handling logic.


> +static bool mmc_can_do_foreground_hpi(struct mmc_queue *mq,
> + struct request *req, unsigned int thpi)
> +{
> +
> + /*
> + * If some time has elapsed since the issuing of previous write
> + * command, or if the size of the request was too small, there's
> + * no point in preempting it. Check if it's worthwhile to preempt
> + */
> + int time_elapsed = jiffies_to_msecs(jiffies -
> + mq->mqrq_cur->mmc_active.mrq->cmd->started_time);
> +
> + if (time_elapsed <= thpi)
> + return true;
Some host controllers (or DMA) has possibility to get the byte count of
current transaction. It may be implemented as host api (similar to abort
ops). Then you have more accurate estimation of worthiness.

> +
> + return false;
> +}

Thanks, Kostya

2012-05-09 14:00:16

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] FS: Added demand paging markers to filesystem

On Wednesday 09 May 2012, Dave Chinner wrote:
> > In low end flash devices, some requests might take too long than normal
> > due to background device maintenance (i.e flash erase / reclaim procedure)
> > kicking in in the context of an ongoing write, stalling them by several
> > orders of magnitude.
>
> And thereby stalling what might be writes critical to operation.
> Indeed, how does this affect the system when it starts swapping
> heavily? If you keep stalling writes, the system won't be able to
> swap and free memory...

The point here is that reads have a consistent latency, e.g. 500
microseconds for a small access, while writes have a latency
that can easily become 1000x the read latency (e.g. 500 ms of
blocking the device) depending on the state of the device. Most
of the time, writes are fast as well, but sometimes (when garbage
collection happens in the device), they are extremely slow and
block everything else.
This is the only time we ever want to interrupt a write: keeping
the system running interactively while eventually getting to do
the writeback. There is a small penalty for interrupting the garbage
collection, but the device should be able to pick up its work
at the point where we interrupt it, so we can still make forward
progress.

> > > This really seems like functionality that belongs in an IO
> > > scheduler so that write starvation can be avoided, not in high-level
> > > data read paths where we have no clue about anything else going on
> > > in the IO subsystem....
> >
> > Indeed, the feature is built mostly in the low level device driver and
> > minor changes in the elevator. Changes above the block layer are only
> > about setting
> > attributes and transparent to their operation.
>
> The problem is that the attribute you are setting covers every
> single data read that is done by all users. If that's what you want
> to have happen, then why do you even need a new flag at this layer?
> Just treat every non-REQ_META read request as a demand paged IO and
> you've got exactly the same behaviour without needing to tag at the
> higher layer....

My feeling is that we should just treat every (REQ_SYNC | REQ_READ)
request the same and let them interrupt long-running writes,
independent of whether it's REQ_META or demand paging.

Arnd

2012-05-09 14:01:25

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] mmc: block: Implement HPI invocation and handling logic.

On Wednesday 09 May 2012, [email protected] wrote:
> > +static bool mmc_can_do_foreground_hpi(struct mmc_queue *mq,
> > + struct request *req, unsigned int thpi)
> > +{
> > +
> > + /*
> > + * If some time has elapsed since the issuing of previous write
> > + * command, or if the size of the request was too small, there's
> > + * no point in preempting it. Check if it's worthwhile to preempt
> > + */
> > + int time_elapsed = jiffies_to_msecs(jiffies -
> > + mq->mqrq_cur->mmc_active.mrq->cmd->started_time);
> > +
> > + if (time_elapsed <= thpi)
> > + return true;
> Some host controllers (or DMA) has possibility to get the byte count of
> current transaction. It may be implemented as host api (similar to abort
> ops). Then you have more accurate estimation of worthiness.

I'm rather sure that the byte count is not relevant here: it's not
the actual write that is taking so long, it's the garbage collection
that the device does internally before the write actually gets done.
The data transfer is much faster than the time we are waiting for here.

Arnd

2012-05-09 14:07:23

by Venkatraman S

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] mmc: block: Implement HPI invocation and handling logic.

On Wed, May 9, 2012 at 2:05 PM, <[email protected]> wrote:
>
>> +static bool mmc_can_do_foreground_hpi(struct mmc_queue *mq,
>> + ? ? ? ? ? ? ? ? ? ? struct request *req, unsigned int thpi)
>> +{
>> +
>> + ? ? /*
>> + ? ? ?* If some time has elapsed since the issuing of previous write
>> + ? ? ?* command, or if the size of the request was too small, there's
>> + ? ? ?* no point in preempting it. Check if it's worthwhile to preempt
>> + ? ? ?*/
>> + ? ? int time_elapsed = jiffies_to_msecs(jiffies -
>> + ? ? ? ? ? ? ? ? ? ? mq->mqrq_cur->mmc_active.mrq->cmd->started_time);
>> +
>> + ? ? if (time_elapsed <= thpi)
>> + ? ? ? ? ? ? ? ? ? ? return true;
> Some host controllers (or DMA) has possibility to get the byte count of
> current transaction. It may be implemented as host api (similar to abort
> ops). Then you have more accurate estimation of worthiness.
>

Byte count returned by DMA or the HC doesn't mean that the data has
actually been
burnt into the device (due to internal buffering). This is one of the
reasons for
defining the CORRECTLY_PRG_SECTORS_NUM register in the standard which
can be queried to find how much was correctly written.
Unfortunately it can only be queried after the abort has been issued.

>> +
>> + ? ? return false;
>> +}
>
> Thanks, Kostya
>

2012-05-09 15:03:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] FS: Added demand paging markers to filesystem

On Wed, May 09, 2012 at 01:59:40PM +0000, Arnd Bergmann wrote:
> My feeling is that we should just treat every (REQ_SYNC | REQ_READ)
> request the same and let them interrupt long-running writes,
> independent of whether it's REQ_META or demand paging.

It's funny that the CFQ scheduler used to boost metadata reads that
have REQ_META set - in fact it still does for those filesystems using
the now split out REQ_PRIO.

2012-05-09 15:28:29

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] FS: Added demand paging markers to filesystem

On Mon, May 07, 2012 at 10:16:30PM +0530, S, Venkatraman wrote:

[..]
> This feature doesn't fiddle with the I/O scheduler's ability to balance
> read vs write requests or handling requests from various process queues (CFQ).
>

Does this feature work with CFQ? As CFQ does not submit sync IO (for
idling queues) while async IO is pending and vice a versa (cfq_may_dispatch()).

Thanks
Vivek

2012-05-09 16:55:14

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v2 01/16] FS: Added demand paging markers to filesystem

On Wednesday 09 May 2012, Christoph Hellwig wrote:
> On Wed, May 09, 2012 at 01:59:40PM +0000, Arnd Bergmann wrote:
> > My feeling is that we should just treat every (REQ_SYNC | REQ_READ)
> > request the same and let them interrupt long-running writes,
> > independent of whether it's REQ_META or demand paging.
>
> It's funny that the CFQ scheduler used to boost metadata reads that
> have REQ_META set - in fact it still does for those filesystems using
> the now split out REQ_PRIO.

That certainly sounds more sensible than the opposite.

Of course, this is somewhat unrelated to the question of prioritizing
reads over any writes that are already started. IMHO It would be
pointless to only stop the write in order to do a REQ_PRIO read but
not any other read.

Arnd

2012-05-11 19:19:26

by Venkatraman S

[permalink] [raw]
Subject: Re: [PATCHv2 00/16] [FS, MM, block, MMC]: eMMC High Priority Interrupt Feature

On Wed, May 9, 2012 at 6:15 AM, Minchan Kim <[email protected]> wrote:
> On 05/09/2012 01:31 AM, S, Venkatraman wrote:
>
>> On Tue, May 8, 2012 at 1:16 PM, Minchan Kim <[email protected]> wrote:
>>> On 05/03/2012 11:22 PM, Venkatraman S wrote:
>>>
>>>> Standard eMMC (Embedded MultiMedia Card) specification expects to execute
>>>> one request at a time. If some requests are more important than others, they
>>>> can't be aborted while the flash procedure is in progress.
>>>>
>>>> New versions of the eMMC standard (4.41 and above) specfies a feature
>>>> called High Priority Interrupt (HPI). This enables an ongoing transaction
>>>> to be aborted using a special command (HPI command) so that the card is ready
>>>> to receive new commands immediately. Then the new request can be submitted
>>>> to the card, and optionally the interrupted command can be resumed again.
>>>>
>>>> Some restrictions exist on when and how the command can be used. For example,
>>>> only write and write-like commands (ERASE) can be preempted, and the urgent
>>>> request must be a read.
>>>>
>>>> In order to support this in software,
>>>> a) At the top level, some policy decisions have to be made on what is
>>>> worth preempting for.
>>>> ? ? ? This implementation uses the demand paging requests and swap
>>>> read requests as potential reads worth preempting an ongoing long write.
>>>> ? ? ? This is expected to provide improved responsiveness for smarphones
>>>> with multitasking capabilities - example would be launch a email application
>>>> while a video capture session (which causes long writes) is ongoing.
>>>
>>>
>>> Do you have a number to prove it's really big effective?
>>
>> What type of benchmarks would be appropriate to post ?
>> As you know, the response time of a card would vary depending on
>> whether the flash device
>> has enough empty blocks to write into and doesn't have to resort to GC during
>> write requests.
>> Macro benchmarks like iozone etc would be inappropriate here, as they won't show
>> the latency effects of individual write requests hung up doing page
>> reclaim, which happens
>> once in a while.
>
>
> We don't have such special benchmark so you need time to think how to prove it.
> IMHO, you can use run-many-x-apps.sh which checks elapsed time to activate programs
> by posting by Wu long time ago.
>
> http://www.spinics.net/lists/linux-mm/msg09653.html
>
> Of course, your eMMC is used above 80~90% for triggering GC stress and
> your memory should be used up by dirty pages to happen reclaim.
>
>
>>>
>>> What I have a concern is when we got low memory situation.
>>> Then, writing speed for page reclaim is important for response.
>>> If we allow read preempt write and write is delay, it means read path takes longer time to
>>> get a empty buffer pages in reclaim. In such case, it couldn't be good.
>>>
>>
>> I agree. But when writes are delayed anyway as it exceeds
>> hpi_time_threshold (the window
>> available for invoking HPI), it means that the device is in GC mode
>> and either read or write
>> could be equally delayed. ?Note that even in case of interrupting a
>> write, a single block write
>> (which usually is too short to be interrupted, as designed) is
>> sufficient for doing a page reclaim,
>> and further write requests (including multiblock) would not be subject
>> to HPI, as they will
>> complete within the average time.
>
>
> My point is that it would be better for read to not preempt write-for-page_reclaim.
> And we can identify it by PG_reclaim. You can get the idea.
>
> Anyway, HPI is only feature of a device of many storages and you are requiring modification
> of generic layers although it's not big. So for getting justification and attracting many
> core guys(MM,FS,BLOCK), you should provide data at least.
>
Hi Kim,
Apologies for a delayed response. I am studying your suggestions and
will get back with
some changes and also some profiling data.
Regards,
Venkat.

2012-05-14 07:55:20

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 00/16] [FS, MM, block, MMC]: eMMC High Priority Interrupt Feature

On 05/14/2012 04:43 PM, mani wrote:

> Dear Kim,
>
> I have a query here ..
>
>
> My point is that it would be better for read to not preempt
> write-for-page_reclaim.
> And we can identify it by PG_reclaim. You can get the idea.
>
> I think If there is no page available then no read will proceed.

> When read request comes it reclaim the pages (starts the write if

> syncable pages ) and get back after reclaiming the pages.
> Only then a read request will come to the MMC subsystem.

> And i think the reclaim algorithm will reclaim some substantial amount

> of pages at a time instead of a single page.
> So if we get few pages during the reclamation so there will be no
> problem in halting the another write ops for proceeding the reads ?
>
> Can we think of a scenario when we are reclaiming the pages and write
> ops is going on where as a high priority read for the interrupt handler
> is pending ?
>
> Please correct me if i am wrong.


For example, System can have lots of order-0 pages but little order-big pages.
In this case, for getting big contiguos memory, reclaimer should write out
dirty pages while it can handle order-0 page read request.


>
> Thanks & Regards
> Manish



--
Kind regards,
Minchan Kim