2019-01-18 10:34:17

by Andrea Righi

[permalink] [raw]
Subject: [RFC PATCH 0/3] cgroup: fsio throttle controller

This is a redesign of my old cgroup-io-throttle controller:
https://lwn.net/Articles/330531/

I'm resuming this old patch to point out a problem that I think is still
not solved completely.

= Problem =

The io.max controller works really well at limiting synchronous I/O
(READs), but a lot of I/O requests are initiated outside the context of
the process that is ultimately responsible for its creation (e.g.,
WRITEs).

Throttling at the block layer in some cases is too late and we may end
up slowing down processes that are not responsible for the I/O that
is being processed at that level.

= Proposed solution =

The main idea of this controller is to split I/O measurement and I/O
throttling: I/O is measured at the block layer for READS, at page cache
(dirty pages) for WRITEs, and processes are limited while they're
generating I/O at the VFS level, based on the measured I/O.

= Example =

Here's a trivial example: create 2 cgroups, set an io.max limit of
10MB/s, run a write-intensive workload on both and after a while, from a
root cgroup, run "sync".

# cat /proc/self/cgroup
0::/cg1
# fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based --runtime=30

# cat /proc/self/cgroup
0::/cg2
# fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based --runtime=30

- io.max controller:

# echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg1/io.max
# echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg2/io.max

# cat /proc/self/cgroup
0::/
# time sync

real 0m51,241s
user 0m0,000s
sys 0m0,113s

Ideally "sync" should complete almost immediately, because the root
cgroup is unlimited and it's not doing any I/O at all, but instead it's
blocked for more than 50 sec with io.max, because the writeback is
throttled to satisfy the io.max limits.

- fsio controller:

# echo "259:0 10 10" > /sys/fs/cgroup/unified/cg1/fsio.max_mbs
# echo "259:0 10 10" > /sys/fs/cgroup/unified/cg2/fsio.max_mbs

[you can find details about the syntax in the documentation patch]

# cat /proc/self/cgroup
0::/
# time sync

real 0m0,146s
user 0m0,003s
sys 0m0,001s

= Questions =

Q: Do we need another controller?
A: Probably no, I think it would be better to integrate this policy (or
something similar) in the current blkio controller, this is just to
highlight the problem and get some ideas on how to address it.

Q: What about proportional limits / latency?
A: It should be trivial to add latency-based limits if we integrate this in the
current I/O controller. About proportional limits (weights), they're
strictly related to I/O scheduling and since this controller doesn't touch
I/O dispatching policies it's not trivial to implement proportional limits
(bandwidth limiting is definitely more straightforward).

Q: Applying delays at the VFS layer doesn't prevent I/O spikes during
writeback, right?
A: Correct, the tradeoff here is to tolerate I/O bursts during writeback to
avoid priority inversion problems in the system.

Andrea Righi (3):
fsio-throttle: documentation
fsio-throttle: controller infrastructure
fsio-throttle: instrumentation

Documentation/cgroup-v1/fsio-throttle.txt | 142 +++++++++
block/blk-core.c | 10 +
include/linux/cgroup_subsys.h | 4 +
include/linux/fsio-throttle.h | 43 +++
include/linux/writeback.h | 7 +-
init/Kconfig | 11 +
kernel/cgroup/Makefile | 1 +
kernel/cgroup/fsio-throttle.c | 501 ++++++++++++++++++++++++++++++
mm/filemap.c | 20 +-
mm/page-writeback.c | 14 +-
10 files changed, 749 insertions(+), 4 deletions(-)



2019-01-18 10:34:46

by Andrea Righi

[permalink] [raw]
Subject: [RFC PATCH 1/3] fsio-throttle: documentation

Document the filesystem I/O controller: description, usage, design,
etc.

Signed-off-by: Andrea Righi <[email protected]>
---
Documentation/cgroup-v1/fsio-throttle.txt | 142 ++++++++++++++++++++++
1 file changed, 142 insertions(+)
create mode 100644 Documentation/cgroup-v1/fsio-throttle.txt

diff --git a/Documentation/cgroup-v1/fsio-throttle.txt b/Documentation/cgroup-v1/fsio-throttle.txt
new file mode 100644
index 000000000000..4f33cae2adea
--- /dev/null
+++ b/Documentation/cgroup-v1/fsio-throttle.txt
@@ -0,0 +1,142 @@
+
+ Filesystem I/O throttling controller
+
+----------------------------------------------------------------------
+1. OVERVIEW
+
+This controller allows to limit filesystem I/O of mounted devices of specific
+process containers (cgroups [1]) enforcing delays to the processes that exceed
+the limits defined for their cgroup.
+
+The goal of the filesystem I/O controller is to improve performance
+predictability from applications' point of view and provide performance
+isolation of different control groups sharing the same filesystems.
+
+----------------------------------------------------------------------
+2. DESIGN
+
+I/O activity generated by READs is evaluated at the block layer, WRITEs are
+evaluated when a page changes from clear to dirty (rewriting a page that was
+already dirty doesn't generate extra I/O activity).
+
+Throttling is always performed at the VFS layer.
+
+This solution has the advantage of always being able to determine the
+task/cgroup that originally generated the I/O request and it prevents
+filesystem locking contention and potential priority inversion problems
+(example: journal I/O being throttled that may slow down the entire system).
+
+The downside of this solution is that the controller is more fuzzy (compared to
+the blkio controller) and it allows I/O bursts that may happen at the I/O
+scheduler layer.
+
+----------------------------------------------------------------------
+2.1. TOKEN BUCKET THROTTLING
+
+ Tokens (I/O rate) - <mb_per_second>
+ o
+ o
+ o
+ ....... <--.
+ \ / | Bucket size (burst limit)
+ \ooo/ | <bucket_size_in_mb>
+ --- <--'
+ |ooo
+ Incoming --->|---> Conforming
+ I/O |oo I/O
+ requests -->|--> requests
+ |
+ ---->|
+
+Token bucket [2] throttling: <mb_per_second> tokens are added to the bucket
+every seconds; the bucket can hold at the most <bucket_size_in_mb> tokens; I/O
+requests are accepted if there are available tokens in the bucket; when a
+request of N bytes arrives, N tokens are removed from the bucket; if less than
+N tokens are available in the bucket, the request is delayed until a sufficient
+amount of token is available again in the bucket.
+
+----------------------------------------------------------------------
+3. USER INTERFACE
+
+A new I/O limit (in MB/s) can be defined using the file:
+- fsio.max_mbs
+
+The syntax of a throttling policy is the following:
+
+"<major>:<minor> <mb_per_second> <bucket_size_in_mb>"
+
+Examples:
+
+- set a maximum I/O rate of 10MB/s on /dev/sda (8:0), bucket size = 10MB:
+
+ # echo "8:0 10 10" > /sys/fs/cgroup/cg1/fsio.max_mbs
+
+- remove the I/O limit defined for /dev/sda (8:0):
+
+ # echo "8:0 0 0" > /sys/fs/cgroup/cg1/fsio.max_mbs
+
+----------------------------------------------------------------------
+4. Additional parameters
+
+----------------------------------------------------------------------
+4.1. Sleep timeslice
+
+Sleep timeslice is a configurable parameter that allows to decide the minimum
+time of sleep to enforce to throttled tasks. Tasks will never be put to sleep
+for less than the sleep timeslice. Moreover wakeup timers will be always
+aligned to a multiple of the sleep timeslice.
+
+Increasing the sleep timeslice has the advantage of reducing the overhead of
+the controller: with a more coarse-grained control, less timers are created to
+wake-up tasks, that means less softirq pressure in the system and less overhead
+introduced. However, a bigger sleep timeslice makes the controller more fuzzy
+since throttled tasks are going to receive less throttling events with larger
+sleeps.
+
+The parameter can be changed via:
+/sys/module/fsio_throttle/parameters/throttle_timeslice_ms
+
+The default value is 250 ms.
+
+Example:
+ - set the sleep timeslice to 1s:
+
+ # echo 1000 > /sys/module/fsio_throttle/parameters/throttle_timeslice_ms
+
+----------------------------------------------------------------------
+4.2. Sleep timeframe
+
+This parameter defines maximum time to sleep for a throttled task.
+
+The parameter can be changed via:
+/sys/module/fsio_throttle/parameters/throttle_timeslice_ms
+
+The default value is 2 sec.
+
+Example:
+ - set the sleep timeframe to 5s:
+
+ # echo 5000 > /sys/module/fsio_throttle/parameters/throttle_timeframe_ms
+
+4.3. Throttle kernel threads
+
+By default kernel threads are never throttled or accounted for any I/O
+activity. It is possible to change this behavior by setting 1 to:
+
+/sys/module/fsio_throttle/parameters/throttle_kernel_threads
+
+It is strongly recommended to not change this setting unless you know what you
+are doing.
+
+----------------------------------------------------------------------
+5. TODO
+
+- Integration with the blkio controller
+- Provide distinct read/write limits, as well as MBs vs iops
+- Provide additional statistics in cgroupfs
+
+----------------------------------------------------------------------
+6. REFERENCES
+
+[1] Documentation/admin-guide/cgroup-v2.rst
+[2] http://en.wikipedia.org/wiki/Token_bucket
--
2.17.1


2019-01-18 10:35:07

by Andrea Righi

[permalink] [raw]
Subject: [RFC PATCH 3/3] fsio-throttle: instrumentation

Apply the fsio controller to the opportune kernel functions to evaluate
and throttle filesystem I/O.

Signed-off-by: Andrea Righi <[email protected]>
---
block/blk-core.c | 10 ++++++++++
include/linux/writeback.h | 7 ++++++-
mm/filemap.c | 20 +++++++++++++++++++-
mm/page-writeback.c | 14 ++++++++++++--
4 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3c5f61ceeb67..4b4717f64ac1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -16,6 +16,7 @@
#include <linux/backing-dev.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
+#include <linux/fsio-throttle.h>
#include <linux/blk-mq.h>
#include <linux/highmem.h>
#include <linux/mm.h>
@@ -956,6 +957,15 @@ generic_make_request_checks(struct bio *bio)
*/
create_io_context(GFP_ATOMIC, q->node);

+ /*
+ * Account only READs at this layer (WRITEs are accounted and throttled
+ * in balance_dirty_pages()) and don't enfore sleeps (state=0): in this
+ * way we can prevent potential lock contentions and priority inversion
+ * problems at the filesystem layer.
+ */
+ if (bio_op(bio) == REQ_OP_READ)
+ fsio_throttle(bio_dev(bio), bio->bi_iter.bi_size, 0);
+
if (!blkcg_bio_issue_check(q, bio))
return false;

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 738a0c24874f..1e161c7969e5 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -356,7 +356,12 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);

void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time);
-void balance_dirty_pages_ratelimited(struct address_space *mapping);
+
+#define balance_dirty_pages_ratelimited(__mapping) \
+ __balance_dirty_pages_ratelimited(__mapping, false)
+void __balance_dirty_pages_ratelimited(struct address_space *mapping,
+ bool redirty);
+
bool wb_over_bg_thresh(struct bdi_writeback *wb);

typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
diff --git a/mm/filemap.c b/mm/filemap.c
index 9f5e323e883e..5cc0959274d6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -29,6 +29,7 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
#include <linux/blkdev.h>
+#include <linux/fsio-throttle.h>
#include <linux/security.h>
#include <linux/cpuset.h>
#include <linux/hugetlb.h>
@@ -2040,6 +2041,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
{
struct file *filp = iocb->ki_filp;
struct address_space *mapping = filp->f_mapping;
+ struct block_device *bdev = as_to_bdev(mapping);
struct inode *inode = mapping->host;
struct file_ra_state *ra = &filp->f_ra;
loff_t *ppos = &iocb->ki_pos;
@@ -2068,6 +2070,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,

cond_resched();
find_page:
+ fsio_throttle(bdev_to_dev(bdev), 0, TASK_INTERRUPTIBLE);
if (fatal_signal_pending(current)) {
error = -EINTR;
goto out;
@@ -2308,11 +2311,17 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
if (iocb->ki_flags & IOCB_DIRECT) {
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
+ struct block_device *bdev = as_to_bdev(mapping);
struct inode *inode = mapping->host;
loff_t size;

size = i_size_read(inode);
if (iocb->ki_flags & IOCB_NOWAIT) {
+ unsigned long long sleep;
+
+ sleep = fsio_throttle(bdev_to_dev(bdev), 0, 0);
+ if (sleep)
+ return -EAGAIN;
if (filemap_range_has_page(mapping, iocb->ki_pos,
iocb->ki_pos + count - 1))
return -EAGAIN;
@@ -2322,6 +2331,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
iocb->ki_pos + count - 1);
if (retval < 0)
goto out;
+ fsio_throttle(bdev_to_dev(bdev), 0, TASK_INTERRUPTIBLE);
}

file_accessed(file);
@@ -2366,9 +2376,11 @@ EXPORT_SYMBOL(generic_file_read_iter);
static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
{
struct address_space *mapping = file->f_mapping;
+ struct block_device *bdev = as_to_bdev(mapping);
struct page *page;
int ret;

+ fsio_throttle(bdev_to_dev(bdev), 0, TASK_INTERRUPTIBLE);
do {
page = __page_cache_alloc(gfp_mask);
if (!page)
@@ -2498,11 +2510,15 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
*/
page = find_get_page(mapping, offset);
if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
+ struct block_device *bdev = as_to_bdev(mapping);
/*
* We found the page, so try async readahead before
* waiting for the lock.
*/
do_async_mmap_readahead(vmf->vma, ra, file, page, offset);
+ if (unlikely(!PageUptodate(page)))
+ fsio_throttle(bdev_to_dev(bdev), 0,
+ TASK_INTERRUPTIBLE);
} else if (!page) {
/* No page in the page cache at all */
do_sync_mmap_readahead(vmf->vma, ra, file, offset);
@@ -3172,6 +3188,7 @@ ssize_t generic_perform_write(struct file *file,
long status = 0;
ssize_t written = 0;
unsigned int flags = 0;
+ unsigned int dirty;

do {
struct page *page;
@@ -3216,6 +3233,7 @@ ssize_t generic_perform_write(struct file *file,
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
flush_dcache_page(page);

+ dirty = PageDirty(page);
status = a_ops->write_end(file, mapping, pos, bytes, copied,
page, fsdata);
if (unlikely(status < 0))
@@ -3241,7 +3259,7 @@ ssize_t generic_perform_write(struct file *file,
pos += copied;
written += copied;

- balance_dirty_pages_ratelimited(mapping);
+ __balance_dirty_pages_ratelimited(mapping, dirty);
} while (iov_iter_count(i));

return written ? written : status;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7d1010453fb9..694ede8783f3 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -20,6 +20,7 @@
#include <linux/slab.h>
#include <linux/pagemap.h>
#include <linux/writeback.h>
+#include <linux/fsio-throttle.h>
#include <linux/init.h>
#include <linux/backing-dev.h>
#include <linux/task_io_accounting_ops.h>
@@ -1858,10 +1859,12 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
* limit we decrease the ratelimiting by a lot, to prevent individual processes
* from overshooting the limit by (ratelimit_pages) each.
*/
-void balance_dirty_pages_ratelimited(struct address_space *mapping)
+void __balance_dirty_pages_ratelimited(struct address_space *mapping,
+ bool redirty)
{
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
+ struct block_device *bdev = as_to_bdev(mapping);
struct bdi_writeback *wb = NULL;
int ratelimit;
int *p;
@@ -1878,6 +1881,13 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
if (wb->dirty_exceeded)
ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

+ /*
+ * Throttle filesystem I/O only if page was initially clean: re-writing
+ * a dirty page doesn't generate additional I/O.
+ */
+ if (!redirty)
+ fsio_throttle(bdev_to_dev(bdev), PAGE_SIZE, TASK_KILLABLE);
+
preempt_disable();
/*
* This prevents one CPU to accumulate too many dirtied pages without
@@ -1911,7 +1921,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)

wb_put(wb);
}
-EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
+EXPORT_SYMBOL(__balance_dirty_pages_ratelimited);

/**
* wb_over_bg_thresh - does @wb need to be written back?
--
2.17.1


2019-01-18 11:06:15

by Paolo Valente

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller



> Il giorno 18 gen 2019, alle ore 11:31, Andrea Righi <[email protected]> ha scritto:
>
> This is a redesign of my old cgroup-io-throttle controller:
> https://lwn.net/Articles/330531/
>
> I'm resuming this old patch to point out a problem that I think is still
> not solved completely.
>
> = Problem =
>
> The io.max controller works really well at limiting synchronous I/O
> (READs), but a lot of I/O requests are initiated outside the context of
> the process that is ultimately responsible for its creation (e.g.,
> WRITEs).
>
> Throttling at the block layer in some cases is too late and we may end
> up slowing down processes that are not responsible for the I/O that
> is being processed at that level.
>
> = Proposed solution =
>
> The main idea of this controller is to split I/O measurement and I/O
> throttling: I/O is measured at the block layer for READS, at page cache
> (dirty pages) for WRITEs, and processes are limited while they're
> generating I/O at the VFS level, based on the measured I/O.
>

Hi Andrea,
what the about the case where two processes are dirtying the same
pages? Which will be charged?

Thanks,
Paolo

> = Example =
>
> Here's a trivial example: create 2 cgroups, set an io.max limit of
> 10MB/s, run a write-intensive workload on both and after a while, from a
> root cgroup, run "sync".
>
> # cat /proc/self/cgroup
> 0::/cg1
> # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based --runtime=30
>
> # cat /proc/self/cgroup
> 0::/cg2
> # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based --runtime=30
>
> - io.max controller:
>
> # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg1/io.max
> # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg2/io.max
>
> # cat /proc/self/cgroup
> 0::/
> # time sync
>
> real 0m51,241s
> user 0m0,000s
> sys 0m0,113s
>
> Ideally "sync" should complete almost immediately, because the root
> cgroup is unlimited and it's not doing any I/O at all, but instead it's
> blocked for more than 50 sec with io.max, because the writeback is
> throttled to satisfy the io.max limits.
>
> - fsio controller:
>
> # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg1/fsio.max_mbs
> # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg2/fsio.max_mbs
>
> [you can find details about the syntax in the documentation patch]
>
> # cat /proc/self/cgroup
> 0::/
> # time sync
>
> real 0m0,146s
> user 0m0,003s
> sys 0m0,001s
>
> = Questions =
>
> Q: Do we need another controller?
> A: Probably no, I think it would be better to integrate this policy (or
> something similar) in the current blkio controller, this is just to
> highlight the problem and get some ideas on how to address it.
>
> Q: What about proportional limits / latency?
> A: It should be trivial to add latency-based limits if we integrate this in the
> current I/O controller. About proportional limits (weights), they're
> strictly related to I/O scheduling and since this controller doesn't touch
> I/O dispatching policies it's not trivial to implement proportional limits
> (bandwidth limiting is definitely more straightforward).
>
> Q: Applying delays at the VFS layer doesn't prevent I/O spikes during
> writeback, right?
> A: Correct, the tradeoff here is to tolerate I/O bursts during writeback to
> avoid priority inversion problems in the system.
>
> Andrea Righi (3):
> fsio-throttle: documentation
> fsio-throttle: controller infrastructure
> fsio-throttle: instrumentation
>
> Documentation/cgroup-v1/fsio-throttle.txt | 142 +++++++++
> block/blk-core.c | 10 +
> include/linux/cgroup_subsys.h | 4 +
> include/linux/fsio-throttle.h | 43 +++
> include/linux/writeback.h | 7 +-
> init/Kconfig | 11 +
> kernel/cgroup/Makefile | 1 +
> kernel/cgroup/fsio-throttle.c | 501 ++++++++++++++++++++++++++++++
> mm/filemap.c | 20 +-
> mm/page-writeback.c | 14 +-
> 10 files changed, 749 insertions(+), 4 deletions(-)
>


2019-01-18 11:11:34

by Andrea Righi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

On Fri, Jan 18, 2019 at 12:04:17PM +0100, Paolo Valente wrote:
>
>
> > Il giorno 18 gen 2019, alle ore 11:31, Andrea Righi <[email protected]> ha scritto:
> >
> > This is a redesign of my old cgroup-io-throttle controller:
> > https://lwn.net/Articles/330531/
> >
> > I'm resuming this old patch to point out a problem that I think is still
> > not solved completely.
> >
> > = Problem =
> >
> > The io.max controller works really well at limiting synchronous I/O
> > (READs), but a lot of I/O requests are initiated outside the context of
> > the process that is ultimately responsible for its creation (e.g.,
> > WRITEs).
> >
> > Throttling at the block layer in some cases is too late and we may end
> > up slowing down processes that are not responsible for the I/O that
> > is being processed at that level.
> >
> > = Proposed solution =
> >
> > The main idea of this controller is to split I/O measurement and I/O
> > throttling: I/O is measured at the block layer for READS, at page cache
> > (dirty pages) for WRITEs, and processes are limited while they're
> > generating I/O at the VFS level, based on the measured I/O.
> >
>
> Hi Andrea,
> what the about the case where two processes are dirtying the same
> pages? Which will be charged?
>
> Thanks,
> Paolo

Hi Paolo,

in this case only the first one will be charged for the I/O activity
(the one that changes a page from clean to dirty). This is probably not
totally fair in some cases, but I think it's a good compromise, at the
end rewriting the same page over and over while it's already dirty
doesn't actually generate I/O activity, until the page is flushed back
to disk.

Obviously I'm open to other better ideas and suggestions.

Thanks!
-Andrea

2019-01-18 11:15:19

by Paolo Valente

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller



> Il giorno 18 gen 2019, alle ore 12:10, Andrea Righi <[email protected]> ha scritto:
>
> On Fri, Jan 18, 2019 at 12:04:17PM +0100, Paolo Valente wrote:
>>
>>
>>> Il giorno 18 gen 2019, alle ore 11:31, Andrea Righi <[email protected]> ha scritto:
>>>
>>> This is a redesign of my old cgroup-io-throttle controller:
>>> https://lwn.net/Articles/330531/
>>>
>>> I'm resuming this old patch to point out a problem that I think is still
>>> not solved completely.
>>>
>>> = Problem =
>>>
>>> The io.max controller works really well at limiting synchronous I/O
>>> (READs), but a lot of I/O requests are initiated outside the context of
>>> the process that is ultimately responsible for its creation (e.g.,
>>> WRITEs).
>>>
>>> Throttling at the block layer in some cases is too late and we may end
>>> up slowing down processes that are not responsible for the I/O that
>>> is being processed at that level.
>>>
>>> = Proposed solution =
>>>
>>> The main idea of this controller is to split I/O measurement and I/O
>>> throttling: I/O is measured at the block layer for READS, at page cache
>>> (dirty pages) for WRITEs, and processes are limited while they're
>>> generating I/O at the VFS level, based on the measured I/O.
>>>
>>
>> Hi Andrea,
>> what the about the case where two processes are dirtying the same
>> pages? Which will be charged?
>>
>> Thanks,
>> Paolo
>
> Hi Paolo,
>
> in this case only the first one will be charged for the I/O activity
> (the one that changes a page from clean to dirty). This is probably not
> totally fair in some cases, but I think it's a good compromise,

Absolutely, I just wanted to better understand this point.

> at the
> end rewriting the same page over and over while it's already dirty
> doesn't actually generate I/O activity, until the page is flushed back
> to disk.
>

Right.

Thanks,
Paolo

> Obviously I'm open to other better ideas and suggestions.
>
> Thanks!
> -Andrea


2019-01-18 16:37:12

by Josef Bacik

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> This is a redesign of my old cgroup-io-throttle controller:
> https://lwn.net/Articles/330531/
>
> I'm resuming this old patch to point out a problem that I think is still
> not solved completely.
>
> = Problem =
>
> The io.max controller works really well at limiting synchronous I/O
> (READs), but a lot of I/O requests are initiated outside the context of
> the process that is ultimately responsible for its creation (e.g.,
> WRITEs).
>
> Throttling at the block layer in some cases is too late and we may end
> up slowing down processes that are not responsible for the I/O that
> is being processed at that level.

How so? The writeback threads are per-cgroup and have the cgroup stuff set
properly. So if you dirty a bunch of pages, they are associated with your
cgroup, and then writeback happens and it's done in the writeback thread
associated with your cgroup and then that is throttled. Then you are throttled
at balance_dirty_pages() because the writeout is taking longer.

I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
clearly tie IO to the thing generating the IO, such as readahead and such. If
you are running into this case that may be something worth using. Course it
only works for io.latency now but there's no reason you can't add support to it
for io.max or whatever.

>
> = Proposed solution =
>
> The main idea of this controller is to split I/O measurement and I/O
> throttling: I/O is measured at the block layer for READS, at page cache
> (dirty pages) for WRITEs, and processes are limited while they're
> generating I/O at the VFS level, based on the measured I/O.
>

This is what blk_cgroup_congested() is meant to accomplish, I would suggest
looking into that route and simply changing the existing io controller you are
using to take advantage of that so it will actually throttle things. Then just
sprinkle it around the areas where we indirectly generate IO. Thanks,

Josef

2019-01-18 17:09:53

by Paolo Valente

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller



> Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik <[email protected]> ha scritto:
>
> On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
>> This is a redesign of my old cgroup-io-throttle controller:
>> https://lwn.net/Articles/330531/
>>
>> I'm resuming this old patch to point out a problem that I think is still
>> not solved completely.
>>
>> = Problem =
>>
>> The io.max controller works really well at limiting synchronous I/O
>> (READs), but a lot of I/O requests are initiated outside the context of
>> the process that is ultimately responsible for its creation (e.g.,
>> WRITEs).
>>
>> Throttling at the block layer in some cases is too late and we may end
>> up slowing down processes that are not responsible for the I/O that
>> is being processed at that level.
>
> How so? The writeback threads are per-cgroup and have the cgroup stuff set
> properly. So if you dirty a bunch of pages, they are associated with your
> cgroup, and then writeback happens and it's done in the writeback thread
> associated with your cgroup and then that is throttled. Then you are throttled
> at balance_dirty_pages() because the writeout is taking longer.
>

IIUC, Andrea described this problem: certain processes in a certain group dirty a
lot of pages, causing write back to start. Then some other blameless
process in the same group experiences very high latency, in spite of
the fact that it has to do little I/O.

Does your blk_cgroup_congested() stuff solves this issue?

Or simply I didn't get what Andrea meant at all :)

Thanks,
Paolo

> I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> clearly tie IO to the thing generating the IO, such as readahead and such. If
> you are running into this case that may be something worth using. Course it
> only works for io.latency now but there's no reason you can't add support to it
> for io.max or whatever.
>
>>
>> = Proposed solution =
>>
>> The main idea of this controller is to split I/O measurement and I/O
>> throttling: I/O is measured at the block layer for READS, at page cache
>> (dirty pages) for WRITEs, and processes are limited while they're
>> generating I/O at the VFS level, based on the measured I/O.
>>
>
> This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> looking into that route and simply changing the existing io controller you are
> using to take advantage of that so it will actually throttle things. Then just
> sprinkle it around the areas where we indirectly generate IO. Thanks,
>
> Josef


2019-01-18 17:14:27

by Josef Bacik

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

On Fri, Jan 18, 2019 at 06:07:45PM +0100, Paolo Valente wrote:
>
>
> > Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik <[email protected]> ha scritto:
> >
> > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> >> This is a redesign of my old cgroup-io-throttle controller:
> >> https://lwn.net/Articles/330531/
> >>
> >> I'm resuming this old patch to point out a problem that I think is still
> >> not solved completely.
> >>
> >> = Problem =
> >>
> >> The io.max controller works really well at limiting synchronous I/O
> >> (READs), but a lot of I/O requests are initiated outside the context of
> >> the process that is ultimately responsible for its creation (e.g.,
> >> WRITEs).
> >>
> >> Throttling at the block layer in some cases is too late and we may end
> >> up slowing down processes that are not responsible for the I/O that
> >> is being processed at that level.
> >
> > How so? The writeback threads are per-cgroup and have the cgroup stuff set
> > properly. So if you dirty a bunch of pages, they are associated with your
> > cgroup, and then writeback happens and it's done in the writeback thread
> > associated with your cgroup and then that is throttled. Then you are throttled
> > at balance_dirty_pages() because the writeout is taking longer.
> >
>
> IIUC, Andrea described this problem: certain processes in a certain group dirty a
> lot of pages, causing write back to start. Then some other blameless
> process in the same group experiences very high latency, in spite of
> the fact that it has to do little I/O.
>

In that case the io controller isn't doing it's job and needs to be fixed (or
reconfigured). io.latency guards against this, I assume io.max would keep this
from happening if it were configured properly.

> Does your blk_cgroup_congested() stuff solves this issue?
>
> Or simply I didn't get what Andrea meant at all :)
>

I _think_ Andrea is talking about the fact that we can generate IO indirectly
and never get throttled for it, which is what blk_cgroup_congested() is meant to
address. I added it specifically because some low prio task was just allocating
all of the memory on the system and causing a lot of pressure because of
swapping, but there was no direct feedback loop there. blk_cgroup_congested()
provides that feedback loop.

Course I could be wrong too and we're all just talking past each other ;).
Thanks,

Josef

2019-01-18 18:45:45

by Andrea Righi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote:
> On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> > This is a redesign of my old cgroup-io-throttle controller:
> > https://lwn.net/Articles/330531/
> >
> > I'm resuming this old patch to point out a problem that I think is still
> > not solved completely.
> >
> > = Problem =
> >
> > The io.max controller works really well at limiting synchronous I/O
> > (READs), but a lot of I/O requests are initiated outside the context of
> > the process that is ultimately responsible for its creation (e.g.,
> > WRITEs).
> >
> > Throttling at the block layer in some cases is too late and we may end
> > up slowing down processes that are not responsible for the I/O that
> > is being processed at that level.
>
> How so? The writeback threads are per-cgroup and have the cgroup stuff set
> properly. So if you dirty a bunch of pages, they are associated with your
> cgroup, and then writeback happens and it's done in the writeback thread
> associated with your cgroup and then that is throttled. Then you are throttled
> at balance_dirty_pages() because the writeout is taking longer.

Right, writeback is per-cgroup and slowing down writeback affects only
that specific cgroup, but, there are cases where other processes from
other cgroups may require to wait on that writeback to complete before
doing I/O (for example an fsync() to a file shared among different
cgroups). In this case we may end up blocking cgroups that shouldn't be
blocked, that looks like a priority-inversion problem. This is the
problem that I'm trying to address.

>
> I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> clearly tie IO to the thing generating the IO, such as readahead and such. If
> you are running into this case that may be something worth using. Course it
> only works for io.latency now but there's no reason you can't add support to it
> for io.max or whatever.

IIUC blk_cgroup_congested() is used in readahead I/O (and swap with
memcg), something like this: if the cgroup is already congested don't
generate extra I/O due to readahead. Am I right?

>
> >
> > = Proposed solution =
> >
> > The main idea of this controller is to split I/O measurement and I/O
> > throttling: I/O is measured at the block layer for READS, at page cache
> > (dirty pages) for WRITEs, and processes are limited while they're
> > generating I/O at the VFS level, based on the measured I/O.
> >
>
> This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> looking into that route and simply changing the existing io controller you are
> using to take advantage of that so it will actually throttle things. Then just
> sprinkle it around the areas where we indirectly generate IO. Thanks,

Absolutely, I can probably use blk_cgroup_congested() as a method to
determine when a cgroup should be throttled (instead of doing my own
I/O measuring), but to prevent the "slow writeback slowing down other
cgroups" issue I still need to apply throttling when pages are dirtied
in page cache.

Thanks,
-Andrea

2019-01-18 19:05:30

by Andrea Righi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

On Fri, Jan 18, 2019 at 06:07:45PM +0100, Paolo Valente wrote:
>
>
> > Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik <[email protected]> ha scritto:
> >
> > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> >> This is a redesign of my old cgroup-io-throttle controller:
> >> https://lwn.net/Articles/330531/
> >>
> >> I'm resuming this old patch to point out a problem that I think is still
> >> not solved completely.
> >>
> >> = Problem =
> >>
> >> The io.max controller works really well at limiting synchronous I/O
> >> (READs), but a lot of I/O requests are initiated outside the context of
> >> the process that is ultimately responsible for its creation (e.g.,
> >> WRITEs).
> >>
> >> Throttling at the block layer in some cases is too late and we may end
> >> up slowing down processes that are not responsible for the I/O that
> >> is being processed at that level.
> >
> > How so? The writeback threads are per-cgroup and have the cgroup stuff set
> > properly. So if you dirty a bunch of pages, they are associated with your
> > cgroup, and then writeback happens and it's done in the writeback thread
> > associated with your cgroup and then that is throttled. Then you are throttled
> > at balance_dirty_pages() because the writeout is taking longer.
> >
>
> IIUC, Andrea described this problem: certain processes in a certain group dirty a
> lot of pages, causing write back to start. Then some other blameless
> process in the same group experiences very high latency, in spite of
> the fact that it has to do little I/O.
>
> Does your blk_cgroup_congested() stuff solves this issue?
>
> Or simply I didn't get what Andrea meant at all :)
>
> Thanks,
> Paolo

Yes, there is also this problem: provide fairness among processes
running inside the same cgroup.

This is a tough one, because once the I/O limit is reached whoever
process comes next gets punished, even if it hasn't done any I/O before.

Well, my proposal doesn't solve this problem. To solve this one in the
"throttling" scenario, we should probably save some information about
the previously generated I/O activity and apply a delay proportional to
that (like a dynamic weight for each process inside each cgroup).

AFAICS the io.max controller doesn't solve this problem either.

-Andrea

2019-01-18 19:48:42

by Josef Bacik

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

On Fri, Jan 18, 2019 at 07:44:03PM +0100, Andrea Righi wrote:
> On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote:
> > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> > > This is a redesign of my old cgroup-io-throttle controller:
> > > https://lwn.net/Articles/330531/
> > >
> > > I'm resuming this old patch to point out a problem that I think is still
> > > not solved completely.
> > >
> > > = Problem =
> > >
> > > The io.max controller works really well at limiting synchronous I/O
> > > (READs), but a lot of I/O requests are initiated outside the context of
> > > the process that is ultimately responsible for its creation (e.g.,
> > > WRITEs).
> > >
> > > Throttling at the block layer in some cases is too late and we may end
> > > up slowing down processes that are not responsible for the I/O that
> > > is being processed at that level.
> >
> > How so? The writeback threads are per-cgroup and have the cgroup stuff set
> > properly. So if you dirty a bunch of pages, they are associated with your
> > cgroup, and then writeback happens and it's done in the writeback thread
> > associated with your cgroup and then that is throttled. Then you are throttled
> > at balance_dirty_pages() because the writeout is taking longer.
>
> Right, writeback is per-cgroup and slowing down writeback affects only
> that specific cgroup, but, there are cases where other processes from
> other cgroups may require to wait on that writeback to complete before
> doing I/O (for example an fsync() to a file shared among different
> cgroups). In this case we may end up blocking cgroups that shouldn't be
> blocked, that looks like a priority-inversion problem. This is the
> problem that I'm trying to address.

Well this case is a misconfiguration, you shouldn't be sharing files between
cgroups. But even if you are, fsync() is synchronous, we should be getting the
context from the process itself and thus should have its own rules applied.
There's nothing we can do for outstanding IO, but that shouldn't be that much.
That would need to be dealt with on a per-contoller basis.

>
> >
> > I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> > clearly tie IO to the thing generating the IO, such as readahead and such. If
> > you are running into this case that may be something worth using. Course it
> > only works for io.latency now but there's no reason you can't add support to it
> > for io.max or whatever.
>
> IIUC blk_cgroup_congested() is used in readahead I/O (and swap with
> memcg), something like this: if the cgroup is already congested don't
> generate extra I/O due to readahead. Am I right?

Yeah, but that's just how it's currently used, it can be used any which way we
feel like.

>
> >
> > >
> > > = Proposed solution =
> > >
> > > The main idea of this controller is to split I/O measurement and I/O
> > > throttling: I/O is measured at the block layer for READS, at page cache
> > > (dirty pages) for WRITEs, and processes are limited while they're
> > > generating I/O at the VFS level, based on the measured I/O.
> > >
> >
> > This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> > looking into that route and simply changing the existing io controller you are
> > using to take advantage of that so it will actually throttle things. Then just
> > sprinkle it around the areas where we indirectly generate IO. Thanks,
>
> Absolutely, I can probably use blk_cgroup_congested() as a method to
> determine when a cgroup should be throttled (instead of doing my own
> I/O measuring), but to prevent the "slow writeback slowing down other
> cgroups" issue I still need to apply throttling when pages are dirtied
> in page cache.

Again this is just a fuckup from a configuration stand point. The argument
could be made that sync() is probably broken here, but I think the right
solution here is to just pass the cgroup context along with the writeback
information and use that if it's set instead. Thanks,

Josef

2019-01-19 10:10:58

by Andrea Righi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

On Fri, Jan 18, 2019 at 02:46:53PM -0500, Josef Bacik wrote:
> On Fri, Jan 18, 2019 at 07:44:03PM +0100, Andrea Righi wrote:
> > On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote:
> > > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> > > > This is a redesign of my old cgroup-io-throttle controller:
> > > > https://lwn.net/Articles/330531/
> > > >
> > > > I'm resuming this old patch to point out a problem that I think is still
> > > > not solved completely.
> > > >
> > > > = Problem =
> > > >
> > > > The io.max controller works really well at limiting synchronous I/O
> > > > (READs), but a lot of I/O requests are initiated outside the context of
> > > > the process that is ultimately responsible for its creation (e.g.,
> > > > WRITEs).
> > > >
> > > > Throttling at the block layer in some cases is too late and we may end
> > > > up slowing down processes that are not responsible for the I/O that
> > > > is being processed at that level.
> > >
> > > How so? The writeback threads are per-cgroup and have the cgroup stuff set
> > > properly. So if you dirty a bunch of pages, they are associated with your
> > > cgroup, and then writeback happens and it's done in the writeback thread
> > > associated with your cgroup and then that is throttled. Then you are throttled
> > > at balance_dirty_pages() because the writeout is taking longer.
> >
> > Right, writeback is per-cgroup and slowing down writeback affects only
> > that specific cgroup, but, there are cases where other processes from
> > other cgroups may require to wait on that writeback to complete before
> > doing I/O (for example an fsync() to a file shared among different
> > cgroups). In this case we may end up blocking cgroups that shouldn't be
> > blocked, that looks like a priority-inversion problem. This is the
> > problem that I'm trying to address.
>
> Well this case is a misconfiguration, you shouldn't be sharing files between
> cgroups. But even if you are, fsync() is synchronous, we should be getting the
> context from the process itself and thus should have its own rules applied.
> There's nothing we can do for outstanding IO, but that shouldn't be that much.
> That would need to be dealt with on a per-contoller basis.

OK, fair point. We shouldn't be sharing files between cgroups.

I'm still not sure if we can have similar issues with metadata I/O (that
may introduce latencies like the sync() scenario), I have to investigate
more and do more tests.

>
> >
> > >
> > > I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> > > clearly tie IO to the thing generating the IO, such as readahead and such. If
> > > you are running into this case that may be something worth using. Course it
> > > only works for io.latency now but there's no reason you can't add support to it
> > > for io.max or whatever.
> >
> > IIUC blk_cgroup_congested() is used in readahead I/O (and swap with
> > memcg), something like this: if the cgroup is already congested don't
> > generate extra I/O due to readahead. Am I right?
>
> Yeah, but that's just how it's currently used, it can be used any which way we
> feel like.

I think it'd be very interesting to have the possibility to either
throttle I/O before writeback or during writeback. Right now we can only
throttle writeback. Maybe we can try to introduce some kind of dirty
page throttling controller using blk_cgroup_congested()... Opinions?

>
> >
> > >
> > > >
> > > > = Proposed solution =
> > > >
> > > > The main idea of this controller is to split I/O measurement and I/O
> > > > throttling: I/O is measured at the block layer for READS, at page cache
> > > > (dirty pages) for WRITEs, and processes are limited while they're
> > > > generating I/O at the VFS level, based on the measured I/O.
> > > >
> > >
> > > This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> > > looking into that route and simply changing the existing io controller you are
> > > using to take advantage of that so it will actually throttle things. Then just
> > > sprinkle it around the areas where we indirectly generate IO. Thanks,
> >
> > Absolutely, I can probably use blk_cgroup_congested() as a method to
> > determine when a cgroup should be throttled (instead of doing my own
> > I/O measuring), but to prevent the "slow writeback slowing down other
> > cgroups" issue I still need to apply throttling when pages are dirtied
> > in page cache.
>
> Again this is just a fuckup from a configuration stand point. The argument
> could be made that sync() is probably broken here, but I think the right
> solution here is to just pass the cgroup context along with the writeback
> information and use that if it's set instead. Thanks,

Alright, let's skip the root cgroup for now. I think the point here is
if we want to provide sync() isolation among cgroups or not.

According to the manpage:

sync() causes all pending modifications to filesystem metadata and cached file data to be
written to the underlying filesystems.

And:
According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
may return before the actual writing is done. However Linux waits for I/O completions, and
thus sync() or syncfs() provide the same guarantees as fsync called on every file in the sys‐
tem or filesystem respectively.

Excluding the root cgroup, do you think a sync() issued inside a
specific cgroup should wait for I/O completions only for the writes that
have been generated by that cgroup?

Thanks,
-Andrea

2019-01-21 21:48:37

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:

[..]
> Alright, let's skip the root cgroup for now. I think the point here is
> if we want to provide sync() isolation among cgroups or not.
>
> According to the manpage:
>
> sync() causes all pending modifications to filesystem metadata and cached file data to be
> written to the underlying filesystems.
>
> And:
> According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
> may return before the actual writing is done. However Linux waits for I/O completions, and
> thus sync() or syncfs() provide the same guarantees as fsync called on every file in the sys‐
> tem or filesystem respectively.
>
> Excluding the root cgroup, do you think a sync() issued inside a
> specific cgroup should wait for I/O completions only for the writes that
> have been generated by that cgroup?

Can we account I/O towards the cgroup which issued "sync" only if write
rate of sync cgroup is higher than cgroup to which page belongs to. Will
that solve problem, assuming its doable?

Vivek

2019-01-28 17:42:53

by Andrea Righi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

Hi Vivek,

sorry for the late reply.

On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
>
> [..]
> > Alright, let's skip the root cgroup for now. I think the point here is
> > if we want to provide sync() isolation among cgroups or not.
> >
> > According to the manpage:
> >
> > sync() causes all pending modifications to filesystem metadata and cached file data to be
> > written to the underlying filesystems.
> >
> > And:
> > According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
> > may return before the actual writing is done. However Linux waits for I/O completions, and
> > thus sync() or syncfs() provide the same guarantees as fsync called on every file in the sys‐
> > tem or filesystem respectively.
> >
> > Excluding the root cgroup, do you think a sync() issued inside a
> > specific cgroup should wait for I/O completions only for the writes that
> > have been generated by that cgroup?
>
> Can we account I/O towards the cgroup which issued "sync" only if write
> rate of sync cgroup is higher than cgroup to which page belongs to. Will
> that solve problem, assuming its doable?

Maybe this would mitigate the problem, in part, but it doesn't solve it.

The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
issues "sync", the fast cgroup needs to wait a lot, because writeback is
happening at the speed of the slow cgroup.

Ideally in this case we should bump up the writeback speed, maybe even
temporarily inherit the write rate of the sync cgroup, similarly to a
priority-inversion locking scenario, but I think it's not doable at the
moment without applying big changes.

Or we could isolate the sync domain, meaning that a cgroup issuing a
sync will only wait for the syncing of the pages that belong to that
sync cgroup. But probably also this method requires big changes...

-Andrea

2019-01-28 19:27:25

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote:
> Hi Vivek,
>
> sorry for the late reply.
>
> On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
> >
> > [..]
> > > Alright, let's skip the root cgroup for now. I think the point here is
> > > if we want to provide sync() isolation among cgroups or not.
> > >
> > > According to the manpage:
> > >
> > > sync() causes all pending modifications to filesystem metadata and cached file data to be
> > > written to the underlying filesystems.
> > >
> > > And:
> > > According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
> > > may return before the actual writing is done. However Linux waits for I/O completions, and
> > > thus sync() or syncfs() provide the same guarantees as fsync called on every file in the sys‐
> > > tem or filesystem respectively.
> > >
> > > Excluding the root cgroup, do you think a sync() issued inside a
> > > specific cgroup should wait for I/O completions only for the writes that
> > > have been generated by that cgroup?
> >
> > Can we account I/O towards the cgroup which issued "sync" only if write
> > rate of sync cgroup is higher than cgroup to which page belongs to. Will
> > that solve problem, assuming its doable?
>
> Maybe this would mitigate the problem, in part, but it doesn't solve it.
>
> The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
> issues "sync", the fast cgroup needs to wait a lot, because writeback is
> happening at the speed of the slow cgroup.

Hi Andrea,

But that's true only for I/O which has already been submitted to block
layer, right? Any new I/O yet to be submitted could still be attributed
to faster cgroup requesting sync.

Until and unless cgroups limits are absurdly low, it should not take very
long for already submitted I/O to finish. If yes, then in practice, it
might not be a big problem?

Vivek

>
> Ideally in this case we should bump up the writeback speed, maybe even
> temporarily inherit the write rate of the sync cgroup, similarly to a
> priority-inversion locking scenario, but I think it's not doable at the
> moment without applying big changes.
>
> Or we could isolate the sync domain, meaning that a cgroup issuing a
> sync will only wait for the syncing of the pages that belong to that
> sync cgroup. But probably also this method requires big changes...
>
> -Andrea

2019-01-29 18:41:13

by Andrea Righi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

On Mon, Jan 28, 2019 at 02:26:20PM -0500, Vivek Goyal wrote:
> On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote:
> > Hi Vivek,
> >
> > sorry for the late reply.
> >
> > On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> > > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
> > >
> > > [..]
> > > > Alright, let's skip the root cgroup for now. I think the point here is
> > > > if we want to provide sync() isolation among cgroups or not.
> > > >
> > > > According to the manpage:
> > > >
> > > > sync() causes all pending modifications to filesystem metadata and cached file data to be
> > > > written to the underlying filesystems.
> > > >
> > > > And:
> > > > According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
> > > > may return before the actual writing is done. However Linux waits for I/O completions, and
> > > > thus sync() or syncfs() provide the same guarantees as fsync called on every file in the sys‐
> > > > tem or filesystem respectively.
> > > >
> > > > Excluding the root cgroup, do you think a sync() issued inside a
> > > > specific cgroup should wait for I/O completions only for the writes that
> > > > have been generated by that cgroup?
> > >
> > > Can we account I/O towards the cgroup which issued "sync" only if write
> > > rate of sync cgroup is higher than cgroup to which page belongs to. Will
> > > that solve problem, assuming its doable?
> >
> > Maybe this would mitigate the problem, in part, but it doesn't solve it.
> >
> > The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
> > issues "sync", the fast cgroup needs to wait a lot, because writeback is
> > happening at the speed of the slow cgroup.
>
> Hi Andrea,
>
> But that's true only for I/O which has already been submitted to block
> layer, right? Any new I/O yet to be submitted could still be attributed
> to faster cgroup requesting sync.

Right. If we could bump up the new I/O yet to be submitted I think we
could effectively prevent the priority inversion problem (the ongoing
writeback I/O should be negligible).

>
> Until and unless cgroups limits are absurdly low, it should not take very
> long for already submitted I/O to finish. If yes, then in practice, it
> might not be a big problem?

I was actually doing my tests with a very low limit (1MB/s both for rbps
and wbps), but this shows the problem very well I think.

Here's what I'm doing:

[ slow cgroup (1Mbps read/write) ]

$ cat /sys/fs/cgroup/unified/cg1/io.max
259:0 rbps=1048576 wbps=1048576 riops=max wiops=max
$ cat /proc/self/cgroup
0::/cg1

$ fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer --time_based --runtime=30

[ fast cgroup (root cgroup, no limitation) ]

# cat /proc/self/cgroup
0::/

# time sync
real 9m32,618s
user 0m0,000s
sys 0m0,018s

With this simple test I can easily trigger hung task timeout warnings
and make the whole system totally sluggish (even the processes running
in the root cgroup).

When fio ends, writeback is still taking forever to complete, as you can
see by the insane amount that sync takes to complete.

-Andrea

2019-01-29 18:51:59

by Josef Bacik

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] cgroup: fsio throttle controller

On Tue, Jan 29, 2019 at 07:39:38PM +0100, Andrea Righi wrote:
> On Mon, Jan 28, 2019 at 02:26:20PM -0500, Vivek Goyal wrote:
> > On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote:
> > > Hi Vivek,
> > >
> > > sorry for the late reply.
> > >
> > > On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> > > > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
> > > >
> > > > [..]
> > > > > Alright, let's skip the root cgroup for now. I think the point here is
> > > > > if we want to provide sync() isolation among cgroups or not.
> > > > >
> > > > > According to the manpage:
> > > > >
> > > > > sync() causes all pending modifications to filesystem metadata and cached file data to be
> > > > > written to the underlying filesystems.
> > > > >
> > > > > And:
> > > > > According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
> > > > > may return before the actual writing is done. However Linux waits for I/O completions, and
> > > > > thus sync() or syncfs() provide the same guarantees as fsync called on every file in the sys‐
> > > > > tem or filesystem respectively.
> > > > >
> > > > > Excluding the root cgroup, do you think a sync() issued inside a
> > > > > specific cgroup should wait for I/O completions only for the writes that
> > > > > have been generated by that cgroup?
> > > >
> > > > Can we account I/O towards the cgroup which issued "sync" only if write
> > > > rate of sync cgroup is higher than cgroup to which page belongs to. Will
> > > > that solve problem, assuming its doable?
> > >
> > > Maybe this would mitigate the problem, in part, but it doesn't solve it.
> > >
> > > The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
> > > issues "sync", the fast cgroup needs to wait a lot, because writeback is
> > > happening at the speed of the slow cgroup.
> >
> > Hi Andrea,
> >
> > But that's true only for I/O which has already been submitted to block
> > layer, right? Any new I/O yet to be submitted could still be attributed
> > to faster cgroup requesting sync.
>
> Right. If we could bump up the new I/O yet to be submitted I think we
> could effectively prevent the priority inversion problem (the ongoing
> writeback I/O should be negligible).
>
> >
> > Until and unless cgroups limits are absurdly low, it should not take very
> > long for already submitted I/O to finish. If yes, then in practice, it
> > might not be a big problem?
>
> I was actually doing my tests with a very low limit (1MB/s both for rbps
> and wbps), but this shows the problem very well I think.
>
> Here's what I'm doing:
>
> [ slow cgroup (1Mbps read/write) ]
>
> $ cat /sys/fs/cgroup/unified/cg1/io.max
> 259:0 rbps=1048576 wbps=1048576 riops=max wiops=max
> $ cat /proc/self/cgroup
> 0::/cg1
>
> $ fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer --time_based --runtime=30
>
> [ fast cgroup (root cgroup, no limitation) ]
>
> # cat /proc/self/cgroup
> 0::/
>
> # time sync
> real 9m32,618s
> user 0m0,000s
> sys 0m0,018s
>
> With this simple test I can easily trigger hung task timeout warnings
> and make the whole system totally sluggish (even the processes running
> in the root cgroup).
>
> When fio ends, writeback is still taking forever to complete, as you can
> see by the insane amount that sync takes to complete.
>

Yeah sync() needs to be treated differently, but its kind of special too. We
don't want slow to run sync() and backup fast doing sync() because we make all
of the io go based on the submitting cgroup. The problem here is we don't know
who's more important until we get to the blk cgroup layer, and even then
sometimes we can't tell (different hierarchies would make this tricky with
io.weight or io.latency).

We could treat it like REQ_META and just let everything go through and back
charge. This feels like a way for the slow group to cheat though, unless we
just throttle the shit out of him before returning to user space. I'll have to
think about this some more. Thanks,

Josef