2023-04-26 08:24:02

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v2 0/7] limit the number of plugged bio

From: Yu Kuai <[email protected]>

Changes in v2:
- remove the patch to rename raid1-10.c

This patchset tries to limit the number of plugged bio for raid1 and
raid10, which is done in the last patch, other patches are some refactor
and optimizations.

This patchset is tested with a new test [1], this test triggers dirty
pages writeback for 10s, and in the meantime checks disk inflight.

Before this patchset, test will fail because inflight exceed threshold
(threshold is set to 4096 in the test, in theory this can be mutch
greater as long as there are enough dirty pages and memory).

After this patchset, inflight is within 96 (MAX_PLUG_BIO * copies).

[1] https://lore.kernel.org/linux-raid/[email protected]/

Yu Kuai (7):
md/raid10: prevent soft lockup while flush writes
md/raid1-10: factor out a helper to add bio to plug
md/raid1-10: factor out a helper to submit normal write
md/raid1-10: submit write io directly if bitmap is not enabled
md/md-bitmap: add a new helper to unplug bitmap asynchrously
md/raid1-10: don't handle pluged bio by daemon thread
md/raid1-10: limit the number of plugged bio

drivers/md/md-bitmap.c | 55 +++++++++++++++++++++++++++++++++----
drivers/md/md-bitmap.h | 10 +++++++
drivers/md/raid1-10.c | 62 ++++++++++++++++++++++++++++++++++++++++++
drivers/md/raid1.c | 29 ++++----------------
drivers/md/raid10.c | 47 +++++++-------------------------
5 files changed, 136 insertions(+), 67 deletions(-)

--
2.39.2


2023-04-26 08:24:32

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v2 1/7] md/raid10: prevent soft lockup while flush writes

From: Yu Kuai <[email protected]>

Currently, there is no limit for raid1/raid10 plugged bio. While flushing
writes, raid1 has cond_resched() while raid10 doesn't, and too many
writes can cause soft lockup.

Follow up soft lockup can be triggered easily with writeback test for
raid10 with ramdisks:

watchdog: BUG: soft lockup - CPU#10 stuck for 27s! [md0_raid10:1293]
Call Trace:
<TASK>
call_rcu+0x16/0x20
put_object+0x41/0x80
__delete_object+0x50/0x90
delete_object_full+0x2b/0x40
kmemleak_free+0x46/0xa0
slab_free_freelist_hook.constprop.0+0xed/0x1a0
kmem_cache_free+0xfd/0x300
mempool_free_slab+0x1f/0x30
mempool_free+0x3a/0x100
bio_free+0x59/0x80
bio_put+0xcf/0x2c0
free_r10bio+0xbf/0xf0
raid_end_bio_io+0x78/0xb0
one_write_done+0x8a/0xa0
raid10_end_write_request+0x1b4/0x430
bio_endio+0x175/0x320
brd_submit_bio+0x3b9/0x9b7 [brd]
__submit_bio+0x69/0xe0
submit_bio_noacct_nocheck+0x1e6/0x5a0
submit_bio_noacct+0x38c/0x7e0
flush_pending_writes+0xf0/0x240
raid10d+0xac/0x1ed0

Fix the problem by adding cond_resched() to raid10 like what raid1 did.

Note that unlimited plugged bio still need to be optimized, for example,
in the case of lots of dirty pages writeback, this will take lots of
memory and io latency is quite bad.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid10.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 32fb4ff0acdb..6b31f848a6d9 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -921,6 +921,7 @@ static void flush_pending_writes(struct r10conf *conf)
else
submit_bio_noacct(bio);
bio = next;
+ cond_resched();
}
blk_finish_plug(&plug);
} else
@@ -1145,6 +1146,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
else
submit_bio_noacct(bio);
bio = next;
+ cond_resched();
}
kfree(plug);
}
--
2.39.2

2023-04-26 08:24:50

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v2 2/7] md/raid1-10: factor out a helper to add bio to plug

From: Yu Kuai <[email protected]>

The code in raid1 and raid10 is identical, prepare to limit the number
of pluged bios.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid1-10.c | 16 ++++++++++++++++
drivers/md/raid1.c | 12 +-----------
drivers/md/raid10.c | 11 +----------
3 files changed, 18 insertions(+), 21 deletions(-)

diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
index e61f6cad4e08..5ba0c158ccd7 100644
--- a/drivers/md/raid1-10.c
+++ b/drivers/md/raid1-10.c
@@ -109,3 +109,19 @@ static void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp,
size -= len;
} while (idx++ < RESYNC_PAGES && size > 0);
}
+
+static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
+ blk_plug_cb_fn unplug)
+{
+ struct raid1_plug_cb *plug = NULL;
+ struct blk_plug_cb *cb = blk_check_plugged(unplug, mddev,
+ sizeof(*plug));
+
+ if (!cb)
+ return false;
+
+ plug = container_of(cb, struct raid1_plug_cb, cb);
+ bio_list_add(&plug->pending, bio);
+
+ return true;
+}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 2f1011ffdf09..92083f5329f9 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1343,8 +1343,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
struct bitmap *bitmap = mddev->bitmap;
unsigned long flags;
struct md_rdev *blocked_rdev;
- struct blk_plug_cb *cb;
- struct raid1_plug_cb *plug = NULL;
int first_clone;
int max_sectors;
bool write_behind = false;
@@ -1573,15 +1571,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
r1_bio->sector);
/* flush_pending_writes() needs access to the rdev so...*/
mbio->bi_bdev = (void *)rdev;
-
- cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
- if (cb)
- plug = container_of(cb, struct raid1_plug_cb, cb);
- else
- plug = NULL;
- if (plug) {
- bio_list_add(&plug->pending, mbio);
- } else {
+ if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug)) {
spin_lock_irqsave(&conf->device_lock, flags);
bio_list_add(&conf->pending_bio_list, mbio);
spin_unlock_irqrestore(&conf->device_lock, flags);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 6b31f848a6d9..2a2e138fa804 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1287,8 +1287,6 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
const blk_opf_t do_sync = bio->bi_opf & REQ_SYNC;
const blk_opf_t do_fua = bio->bi_opf & REQ_FUA;
unsigned long flags;
- struct blk_plug_cb *cb;
- struct raid1_plug_cb *plug = NULL;
struct r10conf *conf = mddev->private;
struct md_rdev *rdev;
int devnum = r10_bio->devs[n_copy].devnum;
@@ -1328,14 +1326,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,

atomic_inc(&r10_bio->remaining);

- cb = blk_check_plugged(raid10_unplug, mddev, sizeof(*plug));
- if (cb)
- plug = container_of(cb, struct raid1_plug_cb, cb);
- else
- plug = NULL;
- if (plug) {
- bio_list_add(&plug->pending, mbio);
- } else {
+ if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug)) {
spin_lock_irqsave(&conf->device_lock, flags);
bio_list_add(&conf->pending_bio_list, mbio);
spin_unlock_irqrestore(&conf->device_lock, flags);
--
2.39.2

2023-04-26 08:25:18

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v2 7/7] md/raid1-10: limit the number of plugged bio

From: Yu Kuai <[email protected]>

bio can be added to plug infinitely, and following writeback test can
trigger huge amount of plugged bio:

Test script:
modprobe brd rd_nr=4 rd_size=10485760
mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
echo 0 > /proc/sys/vm/dirty_background_ratio
echo 60 > /proc/sys/vm/dirty_ratio
fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test

Test result:
Monitor /sys/block/md0/inflight will found that inflight keep increasing
until fio finish writing, after running for about 2 minutes:

[root@fedora ~]# cat /sys/block/md0/inflight
0 4474191

Fix the problem by limiting the number of plugged bio based on the number
of copies for original bio.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid1-10.c | 9 ++++++++-
drivers/md/raid1.c | 2 +-
drivers/md/raid10.c | 2 +-
3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
index 98d678b7df3f..35fb80aa37aa 100644
--- a/drivers/md/raid1-10.c
+++ b/drivers/md/raid1-10.c
@@ -21,6 +21,7 @@
#define IO_MADE_GOOD ((struct bio *)2)

#define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
+#define MAX_PLUG_BIO 32

/* for managing resync I/O pages */
struct resync_pages {
@@ -31,6 +32,7 @@ struct resync_pages {
struct raid1_plug_cb {
struct blk_plug_cb cb;
struct bio_list pending;
+ unsigned int count;
};

static void rbio_pool_free(void *rbio, void *data)
@@ -127,7 +129,7 @@ static inline void md_submit_write(struct bio *bio)
}

static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
- blk_plug_cb_fn unplug)
+ blk_plug_cb_fn unplug, int copies)
{
struct raid1_plug_cb *plug = NULL;
struct blk_plug_cb *cb;
@@ -147,6 +149,11 @@ static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,

plug = container_of(cb, struct raid1_plug_cb, cb);
bio_list_add(&plug->pending, bio);
+ if (++plug->count / MAX_PLUG_BIO >= copies) {
+ list_del(&cb->list);
+ cb->callback(cb, false);
+ }
+

return true;
}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 639e09cecf01..c6066408a913 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1562,7 +1562,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
r1_bio->sector);
/* flush_pending_writes() needs access to the rdev so...*/
mbio->bi_bdev = (void *)rdev;
- if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug)) {
+ if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug, disks)) {
spin_lock_irqsave(&conf->device_lock, flags);
bio_list_add(&conf->pending_bio_list, mbio);
spin_unlock_irqrestore(&conf->device_lock, flags);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index bd9e655ca408..7135cfaf75db 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1306,7 +1306,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,

atomic_inc(&r10_bio->remaining);

- if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug)) {
+ if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug, conf->copies)) {
spin_lock_irqsave(&conf->device_lock, flags);
bio_list_add(&conf->pending_bio_list, mbio);
spin_unlock_irqrestore(&conf->device_lock, flags);
--
2.39.2

2023-04-26 08:25:17

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v2 4/7] md/raid1-10: submit write io directly if bitmap is not enabled

From: Yu Kuai <[email protected]>

Commit 6cce3b23f6f8 ("[PATCH] md: write intent bitmap support for raid10")
add bitmap support, and it changed that write io is submitted through
daemon thread because bitmap need to be updated before write io. And
later, plug is used to fix performance regression because all the write io
will go to demon thread, which means io can't be issued concurrently.

However, if bitmap is not enabled, the write io should not go to demon
thread in the first place, and plug is not needed as well.

Fixes: 6cce3b23f6f8 ("[PATCH] md: write intent bitmap support for raid10")
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md-bitmap.c | 4 +---
drivers/md/md-bitmap.h | 7 +++++++
drivers/md/raid1-10.c | 13 +++++++++++--
3 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 2a21633c5917..17e988f88303 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -1023,7 +1023,6 @@ static int md_bitmap_file_test_bit(struct bitmap *bitmap, sector_t block)
return set;
}

-
/* this gets called when the md device is ready to unplug its underlying
* (slave) device queues -- before we let any writes go down, we need to
* sync the dirty pages of the bitmap file to disk */
@@ -1033,8 +1032,7 @@ void md_bitmap_unplug(struct bitmap *bitmap)
int dirty, need_write;
int writing = 0;

- if (!bitmap || !bitmap->storage.filemap ||
- test_bit(BITMAP_STALE, &bitmap->flags))
+ if (!md_bitmap_enabled(bitmap))
return;

/* look at each page to see if there are any set bits that need to be
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index cfd7395de8fd..3a4750952b3a 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -273,6 +273,13 @@ int md_bitmap_copy_from_slot(struct mddev *mddev, int slot,
sector_t *lo, sector_t *hi, bool clear_bits);
void md_bitmap_free(struct bitmap *bitmap);
void md_bitmap_wait_behind_writes(struct mddev *mddev);
+
+static inline bool md_bitmap_enabled(struct bitmap *bitmap)
+{
+ return bitmap && bitmap->storage.filemap &&
+ !test_bit(BITMAP_STALE, &bitmap->flags);
+}
+
#endif

#endif
diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
index 25d55036a6fb..4167bda139b1 100644
--- a/drivers/md/raid1-10.c
+++ b/drivers/md/raid1-10.c
@@ -130,9 +130,18 @@ static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
blk_plug_cb_fn unplug)
{
struct raid1_plug_cb *plug = NULL;
- struct blk_plug_cb *cb = blk_check_plugged(unplug, mddev,
- sizeof(*plug));
+ struct blk_plug_cb *cb;
+
+ /*
+ * If bitmap is not enabled, it's safe to submit the io directly, and
+ * this can get optimal performance.
+ */
+ if (!md_bitmap_enabled(mddev->bitmap)) {
+ md_submit_write(bio);
+ return true;
+ }

+ cb = blk_check_plugged(unplug, mddev, sizeof(*plug));
if (!cb)
return false;

--
2.39.2

2023-04-26 08:25:51

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v2 3/7] md/raid1-10: factor out a helper to submit normal write

From: Yu Kuai <[email protected]>

There are multiple places to do the same thing, factor out a helper to
prevent redundant code, and the helper will be used in following patch
as well.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid1-10.c | 16 ++++++++++++++++
drivers/md/raid1.c | 13 ++-----------
drivers/md/raid10.c | 26 ++++----------------------
3 files changed, 22 insertions(+), 33 deletions(-)

diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
index 5ba0c158ccd7..25d55036a6fb 100644
--- a/drivers/md/raid1-10.c
+++ b/drivers/md/raid1-10.c
@@ -110,6 +110,22 @@ static void md_bio_reset_resync_pages(struct bio *bio, struct resync_pages *rp,
} while (idx++ < RESYNC_PAGES && size > 0);
}

+static inline void md_submit_write(struct bio *bio)
+{
+ struct md_rdev *rdev = (struct md_rdev *)bio->bi_bdev;
+
+ bio->bi_next = NULL;
+ bio_set_dev(bio, rdev->bdev);
+ if (test_bit(Faulty, &rdev->flags))
+ bio_io_error(bio);
+ else if (unlikely(bio_op(bio) == REQ_OP_DISCARD &&
+ !bdev_max_discard_sectors(bio->bi_bdev)))
+ /* Just ignore it */
+ bio_endio(bio);
+ else
+ submit_bio_noacct(bio);
+}
+
static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
blk_plug_cb_fn unplug)
{
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 92083f5329f9..d79525d6ec8f 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -799,17 +799,8 @@ static void flush_bio_list(struct r1conf *conf, struct bio *bio)

while (bio) { /* submit pending writes */
struct bio *next = bio->bi_next;
- struct md_rdev *rdev = (void *)bio->bi_bdev;
- bio->bi_next = NULL;
- bio_set_dev(bio, rdev->bdev);
- if (test_bit(Faulty, &rdev->flags)) {
- bio_io_error(bio);
- } else if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
- !bdev_max_discard_sectors(bio->bi_bdev)))
- /* Just ignore it */
- bio_endio(bio);
- else
- submit_bio_noacct(bio);
+
+ md_submit_write(bio);
bio = next;
cond_resched();
}
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 2a2e138fa804..626cb8ed2099 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -909,17 +909,8 @@ static void flush_pending_writes(struct r10conf *conf)

while (bio) { /* submit pending writes */
struct bio *next = bio->bi_next;
- struct md_rdev *rdev = (void*)bio->bi_bdev;
- bio->bi_next = NULL;
- bio_set_dev(bio, rdev->bdev);
- if (test_bit(Faulty, &rdev->flags)) {
- bio_io_error(bio);
- } else if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
- !bdev_max_discard_sectors(bio->bi_bdev)))
- /* Just ignore it */
- bio_endio(bio);
- else
- submit_bio_noacct(bio);
+
+ md_submit_write(bio);
bio = next;
cond_resched();
}
@@ -1134,17 +1125,8 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)

while (bio) { /* submit pending writes */
struct bio *next = bio->bi_next;
- struct md_rdev *rdev = (void*)bio->bi_bdev;
- bio->bi_next = NULL;
- bio_set_dev(bio, rdev->bdev);
- if (test_bit(Faulty, &rdev->flags)) {
- bio_io_error(bio);
- } else if (unlikely((bio_op(bio) == REQ_OP_DISCARD) &&
- !bdev_max_discard_sectors(bio->bi_bdev)))
- /* Just ignore it */
- bio_endio(bio);
- else
- submit_bio_noacct(bio);
+
+ md_submit_write(bio);
bio = next;
cond_resched();
}
--
2.39.2

2023-04-26 08:25:52

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v2 5/7] md/md-bitmap: add a new helper to unplug bitmap asynchrously

From: Yu Kuai <[email protected]>

If bitmap is enabled, bitmap must update before submitting write io, this
is why unplug callback must move these io to 'conf->pending_io_list' if
'current->bio_list' is not empty, which will suffer performance
degradation.

This patch add a new helper md_bitmap_unplug_async() to submit bitmap io
in a kworker, so that submit bitmap io in raid10_unplug() doesn't require
that 'current->bio_list' is empty.

This patch prepare to limit the number of plugged bio.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md-bitmap.c | 51 +++++++++++++++++++++++++++++++++++++++---
drivers/md/md-bitmap.h | 3 +++
2 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 17e988f88303..1afee157b96a 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -1023,10 +1023,12 @@ static int md_bitmap_file_test_bit(struct bitmap *bitmap, sector_t block)
return set;
}

-/* this gets called when the md device is ready to unplug its underlying
+/*
+ * This gets called when the md device is ready to unplug its underlying
* (slave) device queues -- before we let any writes go down, we need to
- * sync the dirty pages of the bitmap file to disk */
-void md_bitmap_unplug(struct bitmap *bitmap)
+ * sync the dirty pages of the bitmap file to disk.
+ */
+static void md_do_bitmap_unplug(struct bitmap *bitmap)
{
unsigned long i;
int dirty, need_write;
@@ -1059,8 +1061,42 @@ void md_bitmap_unplug(struct bitmap *bitmap)
if (test_bit(BITMAP_WRITE_ERROR, &bitmap->flags))
md_bitmap_file_kick(bitmap);
}
+
+void md_bitmap_unplug(struct bitmap *bitmap)
+{
+ return md_do_bitmap_unplug(bitmap);
+}
EXPORT_SYMBOL(md_bitmap_unplug);

+struct bitmap_unplug_work {
+ struct work_struct work;
+ struct bitmap *bitmap;
+ struct completion *done;
+};
+
+static void md_bitmap_unplug_fn(struct work_struct *work)
+{
+ struct bitmap_unplug_work *unplug_work =
+ container_of(work, struct bitmap_unplug_work, work);
+
+ md_do_bitmap_unplug(unplug_work->bitmap);
+ complete(unplug_work->done);
+}
+
+void md_bitmap_unplug_async(struct bitmap *bitmap)
+{
+ DECLARE_COMPLETION_ONSTACK(done);
+ struct bitmap_unplug_work unplug_work;
+
+ INIT_WORK(&unplug_work.work, md_bitmap_unplug_fn);
+ unplug_work.bitmap = bitmap;
+ unplug_work.done = &done;
+
+ queue_work(bitmap->unplug_wq, &unplug_work.work);
+ wait_for_completion(&done);
+}
+EXPORT_SYMBOL(md_bitmap_unplug_async);
+
static void md_bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int needed);
/* * bitmap_init_from_disk -- called at bitmap_create time to initialize
* the in-memory bitmap from the on-disk bitmap -- also, sets up the
@@ -1776,6 +1812,9 @@ void md_bitmap_free(struct bitmap *bitmap)
if (!bitmap) /* there was no bitmap */
return;

+ if (bitmap->unplug_wq)
+ destroy_workqueue(bitmap->unplug_wq);
+
if (bitmap->sysfs_can_clear)
sysfs_put(bitmap->sysfs_can_clear);

@@ -1866,6 +1905,12 @@ struct bitmap *md_bitmap_create(struct mddev *mddev, int slot)
if (!bitmap)
return ERR_PTR(-ENOMEM);

+ bitmap->unplug_wq = create_workqueue("md_bitmap");
+ if (!bitmap->unplug_wq) {
+ err = -ENOMEM;
+ goto error;
+ }
+
spin_lock_init(&bitmap->counts.lock);
atomic_set(&bitmap->pending_writes, 0);
init_waitqueue_head(&bitmap->write_wait);
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index 3a4750952b3a..55531669db24 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -231,6 +231,8 @@ struct bitmap {

struct kernfs_node *sysfs_can_clear;
int cluster_slot; /* Slot offset for clustered env */
+
+ struct workqueue_struct *unplug_wq;
};

/* the bitmap API */
@@ -264,6 +266,7 @@ void md_bitmap_sync_with_cluster(struct mddev *mddev,
sector_t new_lo, sector_t new_hi);

void md_bitmap_unplug(struct bitmap *bitmap);
+void md_bitmap_unplug_async(struct bitmap *bitmap);
void md_bitmap_daemon_work(struct mddev *mddev);

int md_bitmap_resize(struct bitmap *bitmap, sector_t blocks,
--
2.39.2

2023-04-26 08:27:35

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v2 6/7] md/raid1-10: don't handle pluged bio by daemon thread

From: Yu Kuai <[email protected]>

current->bio_list will be set under submit_bio() context, in this case
bitmap io will be added to the list and wait for current io submission to
finish, while current io submission must wait for bitmap io to be done.
commit 874807a83139 ("md/raid1{,0}: fix deadlock in bitmap_unplug.") fix
the deadlock by handling plugged bio by daemon thread.

On the one hand, the deadlock won't exist after commit a214b949d8e3
("blk-mq: only flush requests from the plug in blk_mq_submit_bio"). On
the other hand, current solution makes it impossible to flush plugged bio
in raid1/10_make_request(), because this will cause that all the writes
will goto daemon thread.

In order to limit the number of plugged bio, commit 874807a83139
("md/raid1{,0}: fix deadlock in bitmap_unplug.") is reverted, and the
deadlock is fixed by handling bitmap io asynchronously.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid1-10.c | 14 ++++++++++++++
drivers/md/raid1.c | 4 ++--
drivers/md/raid10.c | 8 +++-----
3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
index 4167bda139b1..98d678b7df3f 100644
--- a/drivers/md/raid1-10.c
+++ b/drivers/md/raid1-10.c
@@ -150,3 +150,17 @@ static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,

return true;
}
+
+/*
+ * current->bio_list will be set under submit_bio() context, in this case bitmap
+ * io will be added to the list and wait for current io submission to finish,
+ * while current io submission must wait for bitmap io to be done. In order to
+ * avoid such deadlock, submit bitmap io asynchronously.
+ */
+static inline void md_prepare_flush_writes(struct bitmap *bitmap)
+{
+ if (current->bio_list)
+ md_bitmap_unplug_async(bitmap);
+ else
+ md_bitmap_unplug(bitmap);
+}
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d79525d6ec8f..639e09cecf01 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -794,7 +794,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
static void flush_bio_list(struct r1conf *conf, struct bio *bio)
{
/* flush any pending bitmap writes to disk before proceeding w/ I/O */
- md_bitmap_unplug(conf->mddev->bitmap);
+ md_prepare_flush_writes(conf->mddev->bitmap);
wake_up(&conf->wait_barrier);

while (bio) { /* submit pending writes */
@@ -1166,7 +1166,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
struct r1conf *conf = mddev->private;
struct bio *bio;

- if (from_schedule || current->bio_list) {
+ if (from_schedule) {
spin_lock_irq(&conf->device_lock);
bio_list_merge(&conf->pending_bio_list, &plug->pending);
spin_unlock_irq(&conf->device_lock);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 626cb8ed2099..bd9e655ca408 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -902,9 +902,7 @@ static void flush_pending_writes(struct r10conf *conf)
__set_current_state(TASK_RUNNING);

blk_start_plug(&plug);
- /* flush any pending bitmap writes to disk
- * before proceeding w/ I/O */
- md_bitmap_unplug(conf->mddev->bitmap);
+ md_prepare_flush_writes(conf->mddev->bitmap);
wake_up(&conf->wait_barrier);

while (bio) { /* submit pending writes */
@@ -1108,7 +1106,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
struct r10conf *conf = mddev->private;
struct bio *bio;

- if (from_schedule || current->bio_list) {
+ if (from_schedule) {
spin_lock_irq(&conf->device_lock);
bio_list_merge(&conf->pending_bio_list, &plug->pending);
spin_unlock_irq(&conf->device_lock);
@@ -1120,7 +1118,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)

/* we aren't scheduling, so we can do the write-out directly. */
bio = bio_list_get(&plug->pending);
- md_bitmap_unplug(mddev->bitmap);
+ md_prepare_flush_writes(mddev->bitmap);
wake_up(&conf->wait_barrier);

while (bio) { /* submit pending writes */
--
2.39.2

2023-05-12 10:09:09

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH -next v2 0/7] limit the number of plugged bio

Hi,

?? 2023/04/26 16:20, Yu Kuai д??:
> From: Yu Kuai <[email protected]>
>
> Changes in v2:
> - remove the patch to rename raid1-10.c
>
> This patchset tries to limit the number of plugged bio for raid1 and
> raid10, which is done in the last patch, other patches are some refactor
> and optimizations.
>
> This patchset is tested with a new test [1], this test triggers dirty
> pages writeback for 10s, and in the meantime checks disk inflight.
>
> Before this patchset, test will fail because inflight exceed threshold
> (threshold is set to 4096 in the test, in theory this can be mutch
> greater as long as there are enough dirty pages and memory).
>
> After this patchset, inflight is within 96 (MAX_PLUG_BIO * copies).
>
> [1] https://lore.kernel.org/linux-raid/[email protected]/

Friendly ping...

Thanks,
Kuai
>
> Yu Kuai (7):
> md/raid10: prevent soft lockup while flush writes
> md/raid1-10: factor out a helper to add bio to plug
> md/raid1-10: factor out a helper to submit normal write
> md/raid1-10: submit write io directly if bitmap is not enabled
> md/md-bitmap: add a new helper to unplug bitmap asynchrously
> md/raid1-10: don't handle pluged bio by daemon thread
> md/raid1-10: limit the number of plugged bio
>
> drivers/md/md-bitmap.c | 55 +++++++++++++++++++++++++++++++++----
> drivers/md/md-bitmap.h | 10 +++++++
> drivers/md/raid1-10.c | 62 ++++++++++++++++++++++++++++++++++++++++++
> drivers/md/raid1.c | 29 ++++----------------
> drivers/md/raid10.c | 47 +++++++-------------------------
> 5 files changed, 136 insertions(+), 67 deletions(-)
>


2023-05-13 02:08:22

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH -next v2 0/7] limit the number of plugged bio

On Fri, May 12, 2023 at 2:43 AM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/04/26 16:20, Yu Kuai 写道:
> > From: Yu Kuai <[email protected]>
> >
> > Changes in v2:
> > - remove the patch to rename raid1-10.c
> >
> > This patchset tries to limit the number of plugged bio for raid1 and
> > raid10, which is done in the last patch, other patches are some refactor
> > and optimizations.
> >
> > This patchset is tested with a new test [1], this test triggers dirty
> > pages writeback for 10s, and in the meantime checks disk inflight.
> >
> > Before this patchset, test will fail because inflight exceed threshold
> > (threshold is set to 4096 in the test, in theory this can be mutch
> > greater as long as there are enough dirty pages and memory).
> >
> > After this patchset, inflight is within 96 (MAX_PLUG_BIO * copies).
> >
> > [1] https://lore.kernel.org/linux-raid/[email protected]/
>
> Friendly ping...

I am sorry for the delay.

The set looks good overall, but I will need some more time to take a closer
look. A few comments/questions:

1. For functions in raid1-10.c, let's prefix them with raid1_ instead of md_*.
2. Do we need unplug_wq to be per-bitmap? Would a shared queue work?

Thanks,
Song

2023-05-13 02:12:34

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH -next v2 0/7] limit the number of plugged bio

Hi,

在 2023/05/13 8:50, Song Liu 写道:
> On Fri, May 12, 2023 at 2:43 AM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2023/04/26 16:20, Yu Kuai 写道:
>>> From: Yu Kuai <[email protected]>
>>>
>>> Changes in v2:
>>> - remove the patch to rename raid1-10.c
>>>
>>> This patchset tries to limit the number of plugged bio for raid1 and
>>> raid10, which is done in the last patch, other patches are some refactor
>>> and optimizations.
>>>
>>> This patchset is tested with a new test [1], this test triggers dirty
>>> pages writeback for 10s, and in the meantime checks disk inflight.
>>>
>>> Before this patchset, test will fail because inflight exceed threshold
>>> (threshold is set to 4096 in the test, in theory this can be mutch
>>> greater as long as there are enough dirty pages and memory).
>>>
>>> After this patchset, inflight is within 96 (MAX_PLUG_BIO * copies).
>>>
>>> [1] https://lore.kernel.org/linux-raid/[email protected]/
>>
>> Friendly ping...
>
> I am sorry for the delay.
>
> The set looks good overall, but I will need some more time to take a closer
> look. A few comments/questions:
>
> 1. For functions in raid1-10.c, let's prefix them with raid1_ instead of md_*.
Ok, I'll change that in v3.

> 2. Do we need unplug_wq to be per-bitmap? Would a shared queue work?

I think this can work, the limitation is that global workqueue can
support 256 queued work at a time, but this should be enough.

Thanks,
Kuai
>
> Thanks,
> Song
> .
>


2023-05-29 02:24:11

by Xiao Ni

[permalink] [raw]
Subject: Re: [PATCH -next v2 7/7] md/raid1-10: limit the number of plugged bio

Hi Kuai

There is a limitation of the memory in your test. But for most
situations, customers should not set this. Can this change introduce a
performance regression against other situations?

Best Regards
Xiao

On Wed, Apr 26, 2023 at 4:24 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> bio can be added to plug infinitely, and following writeback test can
> trigger huge amount of plugged bio:
>
> Test script:
> modprobe brd rd_nr=4 rd_size=10485760
> mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
> echo 0 > /proc/sys/vm/dirty_background_ratio
> echo 60 > /proc/sys/vm/dirty_ratio
> fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test
>
> Test result:
> Monitor /sys/block/md0/inflight will found that inflight keep increasing
> until fio finish writing, after running for about 2 minutes:
>
> [root@fedora ~]# cat /sys/block/md0/inflight
> 0 4474191
>
> Fix the problem by limiting the number of plugged bio based on the number
> of copies for original bio.
>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/raid1-10.c | 9 ++++++++-
> drivers/md/raid1.c | 2 +-
> drivers/md/raid10.c | 2 +-
> 3 files changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
> index 98d678b7df3f..35fb80aa37aa 100644
> --- a/drivers/md/raid1-10.c
> +++ b/drivers/md/raid1-10.c
> @@ -21,6 +21,7 @@
> #define IO_MADE_GOOD ((struct bio *)2)
>
> #define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
> +#define MAX_PLUG_BIO 32
>
> /* for managing resync I/O pages */
> struct resync_pages {
> @@ -31,6 +32,7 @@ struct resync_pages {
> struct raid1_plug_cb {
> struct blk_plug_cb cb;
> struct bio_list pending;
> + unsigned int count;
> };
>
> static void rbio_pool_free(void *rbio, void *data)
> @@ -127,7 +129,7 @@ static inline void md_submit_write(struct bio *bio)
> }
>
> static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
> - blk_plug_cb_fn unplug)
> + blk_plug_cb_fn unplug, int copies)
> {
> struct raid1_plug_cb *plug = NULL;
> struct blk_plug_cb *cb;
> @@ -147,6 +149,11 @@ static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
>
> plug = container_of(cb, struct raid1_plug_cb, cb);
> bio_list_add(&plug->pending, bio);
> + if (++plug->count / MAX_PLUG_BIO >= copies) {
> + list_del(&cb->list);
> + cb->callback(cb, false);
> + }
> +
>
> return true;
> }
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 639e09cecf01..c6066408a913 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1562,7 +1562,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
> r1_bio->sector);
> /* flush_pending_writes() needs access to the rdev so...*/
> mbio->bi_bdev = (void *)rdev;
> - if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug)) {
> + if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug, disks)) {
> spin_lock_irqsave(&conf->device_lock, flags);
> bio_list_add(&conf->pending_bio_list, mbio);
> spin_unlock_irqrestore(&conf->device_lock, flags);
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index bd9e655ca408..7135cfaf75db 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1306,7 +1306,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
>
> atomic_inc(&r10_bio->remaining);
>
> - if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug)) {
> + if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug, conf->copies)) {
> spin_lock_irqsave(&conf->device_lock, flags);
> bio_list_add(&conf->pending_bio_list, mbio);
> spin_unlock_irqrestore(&conf->device_lock, flags);
> --
> 2.39.2
>


2023-05-29 02:26:19

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH -next v2 7/7] md/raid1-10: limit the number of plugged bio

Hi,

在 2023/05/29 10:08, Xiao Ni 写道:
> Hi Kuai
>
> There is a limitation of the memory in your test. But for most
> situations, customers should not set this. Can this change introduce a
> performance regression against other situations?

Noted that this limitation is just to triggered writeback as soon as
possible in the test, and it's 100% sure real situations can trigger
dirty pages write back asynchronously and continue to produce new dirty
pages.

If a lot of bio is not plugged, then it's the same as before; if a lot
of bio is plugged, noted that before this patchset, these bio will spent
quite a long time in plug, and hence I think performance should be
better.

Thanks,
Kuai
>
> Best Regards
> Xiao
>
> On Wed, Apr 26, 2023 at 4:24 PM Yu Kuai <[email protected]> wrote:
>>
>> From: Yu Kuai <[email protected]>
>>
>> bio can be added to plug infinitely, and following writeback test can
>> trigger huge amount of plugged bio:
>>
>> Test script:
>> modprobe brd rd_nr=4 rd_size=10485760
>> mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
>> echo 0 > /proc/sys/vm/dirty_background_ratio
>> echo 60 > /proc/sys/vm/dirty_ratio
>> fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test
>>
>> Test result:
>> Monitor /sys/block/md0/inflight will found that inflight keep increasing
>> until fio finish writing, after running for about 2 minutes:
>>
>> [root@fedora ~]# cat /sys/block/md0/inflight
>> 0 4474191
>>
>> Fix the problem by limiting the number of plugged bio based on the number
>> of copies for original bio.
>>
>> Signed-off-by: Yu Kuai <[email protected]>
>> ---
>> drivers/md/raid1-10.c | 9 ++++++++-
>> drivers/md/raid1.c | 2 +-
>> drivers/md/raid10.c | 2 +-
>> 3 files changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
>> index 98d678b7df3f..35fb80aa37aa 100644
>> --- a/drivers/md/raid1-10.c
>> +++ b/drivers/md/raid1-10.c
>> @@ -21,6 +21,7 @@
>> #define IO_MADE_GOOD ((struct bio *)2)
>>
>> #define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
>> +#define MAX_PLUG_BIO 32
>>
>> /* for managing resync I/O pages */
>> struct resync_pages {
>> @@ -31,6 +32,7 @@ struct resync_pages {
>> struct raid1_plug_cb {
>> struct blk_plug_cb cb;
>> struct bio_list pending;
>> + unsigned int count;
>> };
>>
>> static void rbio_pool_free(void *rbio, void *data)
>> @@ -127,7 +129,7 @@ static inline void md_submit_write(struct bio *bio)
>> }
>>
>> static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
>> - blk_plug_cb_fn unplug)
>> + blk_plug_cb_fn unplug, int copies)
>> {
>> struct raid1_plug_cb *plug = NULL;
>> struct blk_plug_cb *cb;
>> @@ -147,6 +149,11 @@ static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
>>
>> plug = container_of(cb, struct raid1_plug_cb, cb);
>> bio_list_add(&plug->pending, bio);
>> + if (++plug->count / MAX_PLUG_BIO >= copies) {
>> + list_del(&cb->list);
>> + cb->callback(cb, false);
>> + }
>> +
>>
>> return true;
>> }
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 639e09cecf01..c6066408a913 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1562,7 +1562,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>> r1_bio->sector);
>> /* flush_pending_writes() needs access to the rdev so...*/
>> mbio->bi_bdev = (void *)rdev;
>> - if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug)) {
>> + if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug, disks)) {
>> spin_lock_irqsave(&conf->device_lock, flags);
>> bio_list_add(&conf->pending_bio_list, mbio);
>> spin_unlock_irqrestore(&conf->device_lock, flags);
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index bd9e655ca408..7135cfaf75db 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -1306,7 +1306,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
>>
>> atomic_inc(&r10_bio->remaining);
>>
>> - if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug)) {
>> + if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug, conf->copies)) {
>> spin_lock_irqsave(&conf->device_lock, flags);
>> bio_list_add(&conf->pending_bio_list, mbio);
>> spin_unlock_irqrestore(&conf->device_lock, flags);
>> --
>> 2.39.2
>>
>
> .
>


2023-05-29 03:21:43

by Xiao Ni

[permalink] [raw]
Subject: Re: [PATCH -next v2 7/7] md/raid1-10: limit the number of plugged bio

On Mon, May 29, 2023 at 10:20 AM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/05/29 10:08, Xiao Ni 写道:
> > Hi Kuai
> >
> > There is a limitation of the memory in your test. But for most
> > situations, customers should not set this. Can this change introduce a
> > performance regression against other situations?
>
> Noted that this limitation is just to triggered writeback as soon as
> possible in the test, and it's 100% sure real situations can trigger
> dirty pages write back asynchronously and continue to produce new dirty
> pages.

Hi

I'm confused here. If we want to trigger write back quickly, it needs
to set these two values with a smaller number, rather than 0 and 60.
Right?
>
> If a lot of bio is not plugged, then it's the same as before; if a lot
> of bio is plugged, noted that before this patchset, these bio will spent
> quite a long time in plug, and hence I think performance should be
> better.

Hmm, it depends on if it's sequential or not? If it's a big io
request, can it miss the merge opportunity?

Regards
Xiao

>
> Thanks,
> Kuai
> >
> > Best Regards
> > Xiao
> >
> > On Wed, Apr 26, 2023 at 4:24 PM Yu Kuai <[email protected]> wrote:
> >>
> >> From: Yu Kuai <[email protected]>
> >>
> >> bio can be added to plug infinitely, and following writeback test can
> >> trigger huge amount of plugged bio:
> >>
> >> Test script:
> >> modprobe brd rd_nr=4 rd_size=10485760
> >> mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
> >> echo 0 > /proc/sys/vm/dirty_background_ratio
> >> echo 60 > /proc/sys/vm/dirty_ratio
> >> fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test
> >>
> >> Test result:
> >> Monitor /sys/block/md0/inflight will found that inflight keep increasing
> >> until fio finish writing, after running for about 2 minutes:
> >>
> >> [root@fedora ~]# cat /sys/block/md0/inflight
> >> 0 4474191
> >>
> >> Fix the problem by limiting the number of plugged bio based on the number
> >> of copies for original bio.
> >>
> >> Signed-off-by: Yu Kuai <[email protected]>
> >> ---
> >> drivers/md/raid1-10.c | 9 ++++++++-
> >> drivers/md/raid1.c | 2 +-
> >> drivers/md/raid10.c | 2 +-
> >> 3 files changed, 10 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
> >> index 98d678b7df3f..35fb80aa37aa 100644
> >> --- a/drivers/md/raid1-10.c
> >> +++ b/drivers/md/raid1-10.c
> >> @@ -21,6 +21,7 @@
> >> #define IO_MADE_GOOD ((struct bio *)2)
> >>
> >> #define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
> >> +#define MAX_PLUG_BIO 32
> >>
> >> /* for managing resync I/O pages */
> >> struct resync_pages {
> >> @@ -31,6 +32,7 @@ struct resync_pages {
> >> struct raid1_plug_cb {
> >> struct blk_plug_cb cb;
> >> struct bio_list pending;
> >> + unsigned int count;
> >> };
> >>
> >> static void rbio_pool_free(void *rbio, void *data)
> >> @@ -127,7 +129,7 @@ static inline void md_submit_write(struct bio *bio)
> >> }
> >>
> >> static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
> >> - blk_plug_cb_fn unplug)
> >> + blk_plug_cb_fn unplug, int copies)
> >> {
> >> struct raid1_plug_cb *plug = NULL;
> >> struct blk_plug_cb *cb;
> >> @@ -147,6 +149,11 @@ static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
> >>
> >> plug = container_of(cb, struct raid1_plug_cb, cb);
> >> bio_list_add(&plug->pending, bio);
> >> + if (++plug->count / MAX_PLUG_BIO >= copies) {
> >> + list_del(&cb->list);
> >> + cb->callback(cb, false);
> >> + }
> >> +
> >>
> >> return true;
> >> }
> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> index 639e09cecf01..c6066408a913 100644
> >> --- a/drivers/md/raid1.c
> >> +++ b/drivers/md/raid1.c
> >> @@ -1562,7 +1562,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
> >> r1_bio->sector);
> >> /* flush_pending_writes() needs access to the rdev so...*/
> >> mbio->bi_bdev = (void *)rdev;
> >> - if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug)) {
> >> + if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug, disks)) {
> >> spin_lock_irqsave(&conf->device_lock, flags);
> >> bio_list_add(&conf->pending_bio_list, mbio);
> >> spin_unlock_irqrestore(&conf->device_lock, flags);
> >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> >> index bd9e655ca408..7135cfaf75db 100644
> >> --- a/drivers/md/raid10.c
> >> +++ b/drivers/md/raid10.c
> >> @@ -1306,7 +1306,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
> >>
> >> atomic_inc(&r10_bio->remaining);
> >>
> >> - if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug)) {
> >> + if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug, conf->copies)) {
> >> spin_lock_irqsave(&conf->device_lock, flags);
> >> bio_list_add(&conf->pending_bio_list, mbio);
> >> spin_unlock_irqrestore(&conf->device_lock, flags);
> >> --
> >> 2.39.2
> >>
> >
> > .
> >
>


2023-05-29 03:33:23

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH -next v2 7/7] md/raid1-10: limit the number of plugged bio

Hi,

在 2023/05/29 11:10, Xiao Ni 写道:
> On Mon, May 29, 2023 at 10:20 AM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2023/05/29 10:08, Xiao Ni 写道:
>>> Hi Kuai
>>>
>>> There is a limitation of the memory in your test. But for most
>>> situations, customers should not set this. Can this change introduce a
>>> performance regression against other situations?
>>
>> Noted that this limitation is just to triggered writeback as soon as
>> possible in the test, and it's 100% sure real situations can trigger
>> dirty pages write back asynchronously and continue to produce new dirty
>> pages.
>
> Hi
>
> I'm confused here. If we want to trigger write back quickly, it needs
> to set these two values with a smaller number, rather than 0 and 60.
> Right?

60 is not required, I'll remove this setting.

0 just means write back if there are any dirty pages.
>>
>> If a lot of bio is not plugged, then it's the same as before; if a lot
>> of bio is plugged, noted that before this patchset, these bio will spent
>> quite a long time in plug, and hence I think performance should be
>> better.
>
> Hmm, it depends on if it's sequential or not? If it's a big io
> request, can it miss the merge opportunity?

The bio will still be merged to underlying disks' rq(if it's rq based),
underlying disk won't flush plug untill the number of request exceed
threshold.

Thanks,
Kuai
>
> Regards
> Xiao
>
>>
>> Thanks,
>> Kuai
>>>
>>> Best Regards
>>> Xiao
>>>
>>> On Wed, Apr 26, 2023 at 4:24 PM Yu Kuai <[email protected]> wrote:
>>>>
>>>> From: Yu Kuai <[email protected]>
>>>>
>>>> bio can be added to plug infinitely, and following writeback test can
>>>> trigger huge amount of plugged bio:
>>>>
>>>> Test script:
>>>> modprobe brd rd_nr=4 rd_size=10485760
>>>> mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
>>>> echo 0 > /proc/sys/vm/dirty_background_ratio
>>>> echo 60 > /proc/sys/vm/dirty_ratio
>>>> fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test
>>>>
>>>> Test result:
>>>> Monitor /sys/block/md0/inflight will found that inflight keep increasing
>>>> until fio finish writing, after running for about 2 minutes:
>>>>
>>>> [root@fedora ~]# cat /sys/block/md0/inflight
>>>> 0 4474191
>>>>
>>>> Fix the problem by limiting the number of plugged bio based on the number
>>>> of copies for original bio.
>>>>
>>>> Signed-off-by: Yu Kuai <[email protected]>
>>>> ---
>>>> drivers/md/raid1-10.c | 9 ++++++++-
>>>> drivers/md/raid1.c | 2 +-
>>>> drivers/md/raid10.c | 2 +-
>>>> 3 files changed, 10 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
>>>> index 98d678b7df3f..35fb80aa37aa 100644
>>>> --- a/drivers/md/raid1-10.c
>>>> +++ b/drivers/md/raid1-10.c
>>>> @@ -21,6 +21,7 @@
>>>> #define IO_MADE_GOOD ((struct bio *)2)
>>>>
>>>> #define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
>>>> +#define MAX_PLUG_BIO 32
>>>>
>>>> /* for managing resync I/O pages */
>>>> struct resync_pages {
>>>> @@ -31,6 +32,7 @@ struct resync_pages {
>>>> struct raid1_plug_cb {
>>>> struct blk_plug_cb cb;
>>>> struct bio_list pending;
>>>> + unsigned int count;
>>>> };
>>>>
>>>> static void rbio_pool_free(void *rbio, void *data)
>>>> @@ -127,7 +129,7 @@ static inline void md_submit_write(struct bio *bio)
>>>> }
>>>>
>>>> static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
>>>> - blk_plug_cb_fn unplug)
>>>> + blk_plug_cb_fn unplug, int copies)
>>>> {
>>>> struct raid1_plug_cb *plug = NULL;
>>>> struct blk_plug_cb *cb;
>>>> @@ -147,6 +149,11 @@ static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
>>>>
>>>> plug = container_of(cb, struct raid1_plug_cb, cb);
>>>> bio_list_add(&plug->pending, bio);
>>>> + if (++plug->count / MAX_PLUG_BIO >= copies) {
>>>> + list_del(&cb->list);
>>>> + cb->callback(cb, false);
>>>> + }
>>>> +
>>>>
>>>> return true;
>>>> }
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index 639e09cecf01..c6066408a913 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -1562,7 +1562,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>>>> r1_bio->sector);
>>>> /* flush_pending_writes() needs access to the rdev so...*/
>>>> mbio->bi_bdev = (void *)rdev;
>>>> - if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug)) {
>>>> + if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug, disks)) {
>>>> spin_lock_irqsave(&conf->device_lock, flags);
>>>> bio_list_add(&conf->pending_bio_list, mbio);
>>>> spin_unlock_irqrestore(&conf->device_lock, flags);
>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>> index bd9e655ca408..7135cfaf75db 100644
>>>> --- a/drivers/md/raid10.c
>>>> +++ b/drivers/md/raid10.c
>>>> @@ -1306,7 +1306,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
>>>>
>>>> atomic_inc(&r10_bio->remaining);
>>>>
>>>> - if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug)) {
>>>> + if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug, conf->copies)) {
>>>> spin_lock_irqsave(&conf->device_lock, flags);
>>>> bio_list_add(&conf->pending_bio_list, mbio);
>>>> spin_unlock_irqrestore(&conf->device_lock, flags);
>>>> --
>>>> 2.39.2
>>>>
>>>
>>> .
>>>
>>
>
> .
>


2023-05-29 08:04:26

by Xiao Ni

[permalink] [raw]
Subject: Re: [PATCH -next v2 7/7] md/raid1-10: limit the number of plugged bio

On Mon, May 29, 2023 at 11:18 AM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/05/29 11:10, Xiao Ni 写道:
> > On Mon, May 29, 2023 at 10:20 AM Yu Kuai <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> 在 2023/05/29 10:08, Xiao Ni 写道:
> >>> Hi Kuai
> >>>
> >>> There is a limitation of the memory in your test. But for most
> >>> situations, customers should not set this. Can this change introduce a
> >>> performance regression against other situations?
> >>
> >> Noted that this limitation is just to triggered writeback as soon as
> >> possible in the test, and it's 100% sure real situations can trigger
> >> dirty pages write back asynchronously and continue to produce new dirty
> >> pages.
> >
> > Hi
> >
> > I'm confused here. If we want to trigger write back quickly, it needs
> > to set these two values with a smaller number, rather than 0 and 60.
> > Right?
>
> 60 is not required, I'll remove this setting.
>
> 0 just means write back if there are any dirty pages.

Hi Kuai

Does 0 mean disabling write back? I tried to find the doc that
describes the meaning when setting dirty_background_ration to 0, but I
didn't find it.
In https://www.kernel.org/doc/html/next/admin-guide/sysctl/vm.html it
doesn't describe this. But it says something like this

Note:
dirty_background_bytes is the counterpart of dirty_background_ratio. Only
one of them may be specified at a time. When one sysctl is written it is
immediately taken into account to evaluate the dirty memory limits and the
other appears as 0 when read.

Maybe you can specify dirty_background_ratio to 1 if you want to
trigger write back ASAP.

> >>
> >> If a lot of bio is not plugged, then it's the same as before; if a lot
> >> of bio is plugged, noted that before this patchset, these bio will spent
> >> quite a long time in plug, and hence I think performance should be
> >> better.
> >
> > Hmm, it depends on if it's sequential or not? If it's a big io
> > request, can it miss the merge opportunity?
>
> The bio will still be merged to underlying disks' rq(if it's rq based),
> underlying disk won't flush plug untill the number of request exceed
> threshold.

Thanks for this.

Regards
Xiao
>
> Thanks,
> Kuai
> >
> > Regards
> > Xiao
> >
> >>
> >> Thanks,
> >> Kuai
> >>>
> >>> Best Regards
> >>> Xiao
> >>>
> >>> On Wed, Apr 26, 2023 at 4:24 PM Yu Kuai <[email protected]> wrote:
> >>>>
> >>>> From: Yu Kuai <[email protected]>
> >>>>
> >>>> bio can be added to plug infinitely, and following writeback test can
> >>>> trigger huge amount of plugged bio:
> >>>>
> >>>> Test script:
> >>>> modprobe brd rd_nr=4 rd_size=10485760
> >>>> mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
> >>>> echo 0 > /proc/sys/vm/dirty_background_ratio
> >>>> echo 60 > /proc/sys/vm/dirty_ratio
> >>>> fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test
> >>>>
> >>>> Test result:
> >>>> Monitor /sys/block/md0/inflight will found that inflight keep increasing
> >>>> until fio finish writing, after running for about 2 minutes:
> >>>>
> >>>> [root@fedora ~]# cat /sys/block/md0/inflight
> >>>> 0 4474191
> >>>>
> >>>> Fix the problem by limiting the number of plugged bio based on the number
> >>>> of copies for original bio.
> >>>>
> >>>> Signed-off-by: Yu Kuai <[email protected]>
> >>>> ---
> >>>> drivers/md/raid1-10.c | 9 ++++++++-
> >>>> drivers/md/raid1.c | 2 +-
> >>>> drivers/md/raid10.c | 2 +-
> >>>> 3 files changed, 10 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
> >>>> index 98d678b7df3f..35fb80aa37aa 100644
> >>>> --- a/drivers/md/raid1-10.c
> >>>> +++ b/drivers/md/raid1-10.c
> >>>> @@ -21,6 +21,7 @@
> >>>> #define IO_MADE_GOOD ((struct bio *)2)
> >>>>
> >>>> #define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
> >>>> +#define MAX_PLUG_BIO 32
> >>>>
> >>>> /* for managing resync I/O pages */
> >>>> struct resync_pages {
> >>>> @@ -31,6 +32,7 @@ struct resync_pages {
> >>>> struct raid1_plug_cb {
> >>>> struct blk_plug_cb cb;
> >>>> struct bio_list pending;
> >>>> + unsigned int count;
> >>>> };
> >>>>
> >>>> static void rbio_pool_free(void *rbio, void *data)
> >>>> @@ -127,7 +129,7 @@ static inline void md_submit_write(struct bio *bio)
> >>>> }
> >>>>
> >>>> static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
> >>>> - blk_plug_cb_fn unplug)
> >>>> + blk_plug_cb_fn unplug, int copies)
> >>>> {
> >>>> struct raid1_plug_cb *plug = NULL;
> >>>> struct blk_plug_cb *cb;
> >>>> @@ -147,6 +149,11 @@ static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
> >>>>
> >>>> plug = container_of(cb, struct raid1_plug_cb, cb);
> >>>> bio_list_add(&plug->pending, bio);
> >>>> + if (++plug->count / MAX_PLUG_BIO >= copies) {
> >>>> + list_del(&cb->list);
> >>>> + cb->callback(cb, false);
> >>>> + }
> >>>> +
> >>>>
> >>>> return true;
> >>>> }
> >>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >>>> index 639e09cecf01..c6066408a913 100644
> >>>> --- a/drivers/md/raid1.c
> >>>> +++ b/drivers/md/raid1.c
> >>>> @@ -1562,7 +1562,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
> >>>> r1_bio->sector);
> >>>> /* flush_pending_writes() needs access to the rdev so...*/
> >>>> mbio->bi_bdev = (void *)rdev;
> >>>> - if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug)) {
> >>>> + if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug, disks)) {
> >>>> spin_lock_irqsave(&conf->device_lock, flags);
> >>>> bio_list_add(&conf->pending_bio_list, mbio);
> >>>> spin_unlock_irqrestore(&conf->device_lock, flags);
> >>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> >>>> index bd9e655ca408..7135cfaf75db 100644
> >>>> --- a/drivers/md/raid10.c
> >>>> +++ b/drivers/md/raid10.c
> >>>> @@ -1306,7 +1306,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
> >>>>
> >>>> atomic_inc(&r10_bio->remaining);
> >>>>
> >>>> - if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug)) {
> >>>> + if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug, conf->copies)) {
> >>>> spin_lock_irqsave(&conf->device_lock, flags);
> >>>> bio_list_add(&conf->pending_bio_list, mbio);
> >>>> spin_unlock_irqrestore(&conf->device_lock, flags);
> >>>> --
> >>>> 2.39.2
> >>>>
> >>>
> >>> .
> >>>
> >>
> >
> > .
> >
>


2023-05-29 09:24:00

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH -next v2 7/7] md/raid1-10: limit the number of plugged bio

Hi,

在 2023/05/29 15:57, Xiao Ni 写道:
> On Mon, May 29, 2023 at 11:18 AM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2023/05/29 11:10, Xiao Ni 写道:
>>> On Mon, May 29, 2023 at 10:20 AM Yu Kuai <[email protected]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> 在 2023/05/29 10:08, Xiao Ni 写道:
>>>>> Hi Kuai
>>>>>
>>>>> There is a limitation of the memory in your test. But for most
>>>>> situations, customers should not set this. Can this change introduce a
>>>>> performance regression against other situations?
>>>>
>>>> Noted that this limitation is just to triggered writeback as soon as
>>>> possible in the test, and it's 100% sure real situations can trigger
>>>> dirty pages write back asynchronously and continue to produce new dirty
>>>> pages.
>>>
>>> Hi
>>>
>>> I'm confused here. If we want to trigger write back quickly, it needs
>>> to set these two values with a smaller number, rather than 0 and 60.
>>> Right?
>>
>> 60 is not required, I'll remove this setting.
>>
>> 0 just means write back if there are any dirty pages.
>
> Hi Kuai
>
> Does 0 mean disabling write back? I tried to find the doc that
> describes the meaning when setting dirty_background_ratio to 0, but I
> didn't find it.
> In https://www.kernel.org/doc/html/next/admin-guide/sysctl/vm.html it
> doesn't describe this. But it says something like this
>
> Note:
> dirty_background_bytes is the counterpart of dirty_background_ratio. Only
> one of them may be specified at a time. When one sysctl is written it is
> immediately taken into account to evaluate the dirty memory limits and the
> other appears as 0 when read.
>
> Maybe you can specify dirty_background_ratio to 1 if you want to
> trigger write back ASAP.

The purpose here is to trigger write back ASAP, I'm not an expert here,
but based on test result, 0 obviously doesn't mean disable write back.

Set dirty_background_bytes to a value, dirty_background_ratio will be
set to 0 together, which means dirty_background_ratio is disabled.
However, change dirty_background_ratio from default value to 0, will end
up both dirty_background_ratio and dirty_background_bytes to be 0, and
based on following related code, I think 0 just means write back if
there are any dirty pages.

domain_dirty_limits:
bg_bytes = dirty_background_bytes -> 0
bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100 -> 0

if (bg_bytes)
bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE);
else
bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE; -> 0

dtc->bg_thresh = bg_thresh; -> 0

balance_dirty_pages
nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
if (!laptop_mode && nr_reclaimable > gdtc->bg_thresh &&
!writeback_in_progress(wb))
wb_start_background_writeback(wb); -> writeback ASAP

Thanks,
Kuai
>
>>>>
>>>> If a lot of bio is not plugged, then it's the same as before; if a lot
>>>> of bio is plugged, noted that before this patchset, these bio will spent
>>>> quite a long time in plug, and hence I think performance should be
>>>> better.
>>>
>>> Hmm, it depends on if it's sequential or not? If it's a big io
>>> request, can it miss the merge opportunity?
>>
>> The bio will still be merged to underlying disks' rq(if it's rq based),
>> underlying disk won't flush plug untill the number of request exceed
>> threshold.
>
> Thanks for this.
>
> Regards
> Xiao
>>
>> Thanks,
>> Kuai
>>>
>>> Regards
>>> Xiao
>>>
>>>>
>>>> Thanks,
>>>> Kuai
>>>>>
>>>>> Best Regards
>>>>> Xiao
>>>>>
>>>>> On Wed, Apr 26, 2023 at 4:24 PM Yu Kuai <[email protected]> wrote:
>>>>>>
>>>>>> From: Yu Kuai <[email protected]>
>>>>>>
>>>>>> bio can be added to plug infinitely, and following writeback test can
>>>>>> trigger huge amount of plugged bio:
>>>>>>
>>>>>> Test script:
>>>>>> modprobe brd rd_nr=4 rd_size=10485760
>>>>>> mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
>>>>>> echo 0 > /proc/sys/vm/dirty_background_ratio
>>>>>> echo 60 > /proc/sys/vm/dirty_ratio
>>>>>> fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test
>>>>>>
>>>>>> Test result:
>>>>>> Monitor /sys/block/md0/inflight will found that inflight keep increasing
>>>>>> until fio finish writing, after running for about 2 minutes:
>>>>>>
>>>>>> [root@fedora ~]# cat /sys/block/md0/inflight
>>>>>> 0 4474191
>>>>>>
>>>>>> Fix the problem by limiting the number of plugged bio based on the number
>>>>>> of copies for original bio.
>>>>>>
>>>>>> Signed-off-by: Yu Kuai <[email protected]>
>>>>>> ---
>>>>>> drivers/md/raid1-10.c | 9 ++++++++-
>>>>>> drivers/md/raid1.c | 2 +-
>>>>>> drivers/md/raid10.c | 2 +-
>>>>>> 3 files changed, 10 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
>>>>>> index 98d678b7df3f..35fb80aa37aa 100644
>>>>>> --- a/drivers/md/raid1-10.c
>>>>>> +++ b/drivers/md/raid1-10.c
>>>>>> @@ -21,6 +21,7 @@
>>>>>> #define IO_MADE_GOOD ((struct bio *)2)
>>>>>>
>>>>>> #define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
>>>>>> +#define MAX_PLUG_BIO 32
>>>>>>
>>>>>> /* for managing resync I/O pages */
>>>>>> struct resync_pages {
>>>>>> @@ -31,6 +32,7 @@ struct resync_pages {
>>>>>> struct raid1_plug_cb {
>>>>>> struct blk_plug_cb cb;
>>>>>> struct bio_list pending;
>>>>>> + unsigned int count;
>>>>>> };
>>>>>>
>>>>>> static void rbio_pool_free(void *rbio, void *data)
>>>>>> @@ -127,7 +129,7 @@ static inline void md_submit_write(struct bio *bio)
>>>>>> }
>>>>>>
>>>>>> static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
>>>>>> - blk_plug_cb_fn unplug)
>>>>>> + blk_plug_cb_fn unplug, int copies)
>>>>>> {
>>>>>> struct raid1_plug_cb *plug = NULL;
>>>>>> struct blk_plug_cb *cb;
>>>>>> @@ -147,6 +149,11 @@ static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
>>>>>>
>>>>>> plug = container_of(cb, struct raid1_plug_cb, cb);
>>>>>> bio_list_add(&plug->pending, bio);
>>>>>> + if (++plug->count / MAX_PLUG_BIO >= copies) {
>>>>>> + list_del(&cb->list);
>>>>>> + cb->callback(cb, false);
>>>>>> + }
>>>>>> +
>>>>>>
>>>>>> return true;
>>>>>> }
>>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>>>> index 639e09cecf01..c6066408a913 100644
>>>>>> --- a/drivers/md/raid1.c
>>>>>> +++ b/drivers/md/raid1.c
>>>>>> @@ -1562,7 +1562,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>>>>>> r1_bio->sector);
>>>>>> /* flush_pending_writes() needs access to the rdev so...*/
>>>>>> mbio->bi_bdev = (void *)rdev;
>>>>>> - if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug)) {
>>>>>> + if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug, disks)) {
>>>>>> spin_lock_irqsave(&conf->device_lock, flags);
>>>>>> bio_list_add(&conf->pending_bio_list, mbio);
>>>>>> spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>>>> index bd9e655ca408..7135cfaf75db 100644
>>>>>> --- a/drivers/md/raid10.c
>>>>>> +++ b/drivers/md/raid10.c
>>>>>> @@ -1306,7 +1306,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
>>>>>>
>>>>>> atomic_inc(&r10_bio->remaining);
>>>>>>
>>>>>> - if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug)) {
>>>>>> + if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug, conf->copies)) {
>>>>>> spin_lock_irqsave(&conf->device_lock, flags);
>>>>>> bio_list_add(&conf->pending_bio_list, mbio);
>>>>>> spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>>> --
>>>>>> 2.39.2
>>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>
>>> .
>>>
>>
>
> .
>


2023-05-30 01:25:09

by Xiao Ni

[permalink] [raw]
Subject: Re: [PATCH -next v2 7/7] md/raid1-10: limit the number of plugged bio

On Mon, May 29, 2023 at 4:50 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/05/29 15:57, Xiao Ni 写道:
> > On Mon, May 29, 2023 at 11:18 AM Yu Kuai <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> 在 2023/05/29 11:10, Xiao Ni 写道:
> >>> On Mon, May 29, 2023 at 10:20 AM Yu Kuai <[email protected]> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> 在 2023/05/29 10:08, Xiao Ni 写道:
> >>>>> Hi Kuai
> >>>>>
> >>>>> There is a limitation of the memory in your test. But for most
> >>>>> situations, customers should not set this. Can this change introduce a
> >>>>> performance regression against other situations?
> >>>>
> >>>> Noted that this limitation is just to triggered writeback as soon as
> >>>> possible in the test, and it's 100% sure real situations can trigger
> >>>> dirty pages write back asynchronously and continue to produce new dirty
> >>>> pages.
> >>>
> >>> Hi
> >>>
> >>> I'm confused here. If we want to trigger write back quickly, it needs
> >>> to set these two values with a smaller number, rather than 0 and 60.
> >>> Right?
> >>
> >> 60 is not required, I'll remove this setting.
> >>
> >> 0 just means write back if there are any dirty pages.
> >
> > Hi Kuai
> >
> > Does 0 mean disabling write back? I tried to find the doc that
> > describes the meaning when setting dirty_background_ratio to 0, but I
> > didn't find it.
> > In https://www.kernel.org/doc/html/next/admin-guide/sysctl/vm.html it
> > doesn't describe this. But it says something like this
> >
> > Note:
> > dirty_background_bytes is the counterpart of dirty_background_ratio. Only
> > one of them may be specified at a time. When one sysctl is written it is
> > immediately taken into account to evaluate the dirty memory limits and the
> > other appears as 0 when read.
> >
> > Maybe you can specify dirty_background_ratio to 1 if you want to
> > trigger write back ASAP.
>
> The purpose here is to trigger write back ASAP, I'm not an expert here,
> but based on test result, 0 obviously doesn't mean disable write back.
>
> Set dirty_background_bytes to a value, dirty_background_ratio will be
> set to 0 together, which means dirty_background_ratio is disabled.
> However, change dirty_background_ratio from default value to 0, will end
> up both dirty_background_ratio and dirty_background_bytes to be 0, and
> based on following related code, I think 0 just means write back if
> there are any dirty pages.
>
> domain_dirty_limits:
> bg_bytes = dirty_background_bytes -> 0
> bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100 -> 0
>
> if (bg_bytes)
> bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE);
> else
> bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE; -> 0
>
> dtc->bg_thresh = bg_thresh; -> 0
>
> balance_dirty_pages
> nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
> if (!laptop_mode && nr_reclaimable > gdtc->bg_thresh &&
> !writeback_in_progress(wb))
> wb_start_background_writeback(wb); -> writeback ASAP
>
> Thanks,
> Kuai

Hi Kuai

I'm not an expert about this either. Thanks for all your patches, I
can study more things too. But I still have some questions.

I did a test in my environment something like this:
modprobe brd rd_nr=4 rd_size=10485760
mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
echo 0 > /proc/sys/vm/dirty_background_ratio
fio -filename=/dev/md0 -ioengine=libaio -rw=write -thread -bs=1k-8k
-numjobs=1 -iodepth=128 --runtime=10 -name=xxx
It will cause OOM and the system hangs

modprobe brd rd_nr=4 rd_size=10485760
mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
echo 1 > /proc/sys/vm/dirty_background_ratio (THIS is the only different place)
fio -filename=/dev/md0 -ioengine=libaio -rw=write -thread -bs=1k-8k
-numjobs=1 -iodepth=128 --runtime=10 -name=xxx
It can finish successfully. The value of dirty_background_ration is 1
here means it flushes ASAP

So your method should be the opposite way as you designed. All the
memory can't be flushed in time, so it uses all memory very soon and
the memory runs out and the system hangs. The reason I'm looking at
the test is that do we really need this change. Because in the real
world, most customers don't disable write back. Anyway, it depends on
Song's decision and thanks for your patches again. I'll review V3 and
try to do some performance tests.

Best Regards
Xiao
> >
> >>>>
> >>>> If a lot of bio is not plugged, then it's the same as before; if a lot
> >>>> of bio is plugged, noted that before this patchset, these bio will spent
> >>>> quite a long time in plug, and hence I think performance should be
> >>>> better.
> >>>
> >>> Hmm, it depends on if it's sequential or not? If it's a big io
> >>> request, can it miss the merge opportunity?
> >>
> >> The bio will still be merged to underlying disks' rq(if it's rq based),
> >> underlying disk won't flush plug untill the number of request exceed
> >> threshold.
> >
> > Thanks for this.
> >
> > Regards
> > Xiao
> >>
> >> Thanks,
> >> Kuai
> >>>
> >>> Regards
> >>> Xiao
> >>>
> >>>>
> >>>> Thanks,
> >>>> Kuai
> >>>>>
> >>>>> Best Regards
> >>>>> Xiao
> >>>>>
> >>>>> On Wed, Apr 26, 2023 at 4:24 PM Yu Kuai <[email protected]> wrote:
> >>>>>>
> >>>>>> From: Yu Kuai <[email protected]>
> >>>>>>
> >>>>>> bio can be added to plug infinitely, and following writeback test can
> >>>>>> trigger huge amount of plugged bio:
> >>>>>>
> >>>>>> Test script:
> >>>>>> modprobe brd rd_nr=4 rd_size=10485760
> >>>>>> mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
> >>>>>> echo 0 > /proc/sys/vm/dirty_background_ratio
> >>>>>> echo 60 > /proc/sys/vm/dirty_ratio
> >>>>>> fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test
> >>>>>>
> >>>>>> Test result:
> >>>>>> Monitor /sys/block/md0/inflight will found that inflight keep increasing
> >>>>>> until fio finish writing, after running for about 2 minutes:
> >>>>>>
> >>>>>> [root@fedora ~]# cat /sys/block/md0/inflight
> >>>>>> 0 4474191
> >>>>>>
> >>>>>> Fix the problem by limiting the number of plugged bio based on the number
> >>>>>> of copies for original bio.
> >>>>>>
> >>>>>> Signed-off-by: Yu Kuai <[email protected]>
> >>>>>> ---
> >>>>>> drivers/md/raid1-10.c | 9 ++++++++-
> >>>>>> drivers/md/raid1.c | 2 +-
> >>>>>> drivers/md/raid10.c | 2 +-
> >>>>>> 3 files changed, 10 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
> >>>>>> index 98d678b7df3f..35fb80aa37aa 100644
> >>>>>> --- a/drivers/md/raid1-10.c
> >>>>>> +++ b/drivers/md/raid1-10.c
> >>>>>> @@ -21,6 +21,7 @@
> >>>>>> #define IO_MADE_GOOD ((struct bio *)2)
> >>>>>>
> >>>>>> #define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
> >>>>>> +#define MAX_PLUG_BIO 32
> >>>>>>
> >>>>>> /* for managing resync I/O pages */
> >>>>>> struct resync_pages {
> >>>>>> @@ -31,6 +32,7 @@ struct resync_pages {
> >>>>>> struct raid1_plug_cb {
> >>>>>> struct blk_plug_cb cb;
> >>>>>> struct bio_list pending;
> >>>>>> + unsigned int count;
> >>>>>> };
> >>>>>>
> >>>>>> static void rbio_pool_free(void *rbio, void *data)
> >>>>>> @@ -127,7 +129,7 @@ static inline void md_submit_write(struct bio *bio)
> >>>>>> }
> >>>>>>
> >>>>>> static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
> >>>>>> - blk_plug_cb_fn unplug)
> >>>>>> + blk_plug_cb_fn unplug, int copies)
> >>>>>> {
> >>>>>> struct raid1_plug_cb *plug = NULL;
> >>>>>> struct blk_plug_cb *cb;
> >>>>>> @@ -147,6 +149,11 @@ static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
> >>>>>>
> >>>>>> plug = container_of(cb, struct raid1_plug_cb, cb);
> >>>>>> bio_list_add(&plug->pending, bio);
> >>>>>> + if (++plug->count / MAX_PLUG_BIO >= copies) {
> >>>>>> + list_del(&cb->list);
> >>>>>> + cb->callback(cb, false);
> >>>>>> + }
> >>>>>> +
> >>>>>>
> >>>>>> return true;
> >>>>>> }
> >>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >>>>>> index 639e09cecf01..c6066408a913 100644
> >>>>>> --- a/drivers/md/raid1.c
> >>>>>> +++ b/drivers/md/raid1.c
> >>>>>> @@ -1562,7 +1562,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
> >>>>>> r1_bio->sector);
> >>>>>> /* flush_pending_writes() needs access to the rdev so...*/
> >>>>>> mbio->bi_bdev = (void *)rdev;
> >>>>>> - if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug)) {
> >>>>>> + if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug, disks)) {
> >>>>>> spin_lock_irqsave(&conf->device_lock, flags);
> >>>>>> bio_list_add(&conf->pending_bio_list, mbio);
> >>>>>> spin_unlock_irqrestore(&conf->device_lock, flags);
> >>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> >>>>>> index bd9e655ca408..7135cfaf75db 100644
> >>>>>> --- a/drivers/md/raid10.c
> >>>>>> +++ b/drivers/md/raid10.c
> >>>>>> @@ -1306,7 +1306,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
> >>>>>>
> >>>>>> atomic_inc(&r10_bio->remaining);
> >>>>>>
> >>>>>> - if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug)) {
> >>>>>> + if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug, conf->copies)) {
> >>>>>> spin_lock_irqsave(&conf->device_lock, flags);
> >>>>>> bio_list_add(&conf->pending_bio_list, mbio);
> >>>>>> spin_unlock_irqrestore(&conf->device_lock, flags);
> >>>>>> --
> >>>>>> 2.39.2
> >>>>>>
> >>>>>
> >>>>> .
> >>>>>
> >>>>
> >>>
> >>> .
> >>>
> >>
> >
> > .
> >
>


2023-05-30 02:10:02

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH -next v2 7/7] md/raid1-10: limit the number of plugged bio

Hi,

在 2023/05/30 8:58, Xiao Ni 写道:
> On Mon, May 29, 2023 at 4:50 PM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2023/05/29 15:57, Xiao Ni 写道:
>>> On Mon, May 29, 2023 at 11:18 AM Yu Kuai <[email protected]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> 在 2023/05/29 11:10, Xiao Ni 写道:
>>>>> On Mon, May 29, 2023 at 10:20 AM Yu Kuai <[email protected]> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> 在 2023/05/29 10:08, Xiao Ni 写道:
>>>>>>> Hi Kuai
>>>>>>>
>>>>>>> There is a limitation of the memory in your test. But for most
>>>>>>> situations, customers should not set this. Can this change introduce a
>>>>>>> performance regression against other situations?
>>>>>>
>>>>>> Noted that this limitation is just to triggered writeback as soon as
>>>>>> possible in the test, and it's 100% sure real situations can trigger
>>>>>> dirty pages write back asynchronously and continue to produce new dirty
>>>>>> pages.
>>>>>
>>>>> Hi
>>>>>
>>>>> I'm confused here. If we want to trigger write back quickly, it needs
>>>>> to set these two values with a smaller number, rather than 0 and 60.
>>>>> Right?
>>>>
>>>> 60 is not required, I'll remove this setting.
>>>>
>>>> 0 just means write back if there are any dirty pages.
>>>
>>> Hi Kuai
>>>
>>> Does 0 mean disabling write back? I tried to find the doc that
>>> describes the meaning when setting dirty_background_ratio to 0, but I
>>> didn't find it.
>>> In https://www.kernel.org/doc/html/next/admin-guide/sysctl/vm.html it
>>> doesn't describe this. But it says something like this
>>>
>>> Note:
>>> dirty_background_bytes is the counterpart of dirty_background_ratio. Only
>>> one of them may be specified at a time. When one sysctl is written it is
>>> immediately taken into account to evaluate the dirty memory limits and the
>>> other appears as 0 when read.
>>>
>>> Maybe you can specify dirty_background_ratio to 1 if you want to
>>> trigger write back ASAP.
>>
>> The purpose here is to trigger write back ASAP, I'm not an expert here,
>> but based on test result, 0 obviously doesn't mean disable write back.
>>
>> Set dirty_background_bytes to a value, dirty_background_ratio will be
>> set to 0 together, which means dirty_background_ratio is disabled.
>> However, change dirty_background_ratio from default value to 0, will end
>> up both dirty_background_ratio and dirty_background_bytes to be 0, and
>> based on following related code, I think 0 just means write back if
>> there are any dirty pages.
>>
>> domain_dirty_limits:
>> bg_bytes = dirty_background_bytes -> 0
>> bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100 -> 0
>>
>> if (bg_bytes)
>> bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE);
>> else
>> bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE; -> 0
>>
>> dtc->bg_thresh = bg_thresh; -> 0
>>
>> balance_dirty_pages
>> nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
>> if (!laptop_mode && nr_reclaimable > gdtc->bg_thresh &&
>> !writeback_in_progress(wb))
>> wb_start_background_writeback(wb); -> writeback ASAP
>>
>> Thanks,
>> Kuai
>
> Hi Kuai
>
> I'm not an expert about this either. Thanks for all your patches, I
> can study more things too. But I still have some questions.
>
> I did a test in my environment something like this:
> modprobe brd rd_nr=4 rd_size=10485760
> mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
> echo 0 > /proc/sys/vm/dirty_background_ratio
> fio -filename=/dev/md0 -ioengine=libaio -rw=write -thread -bs=1k-8k
> -numjobs=1 -iodepth=128 --runtime=10 -name=xxx
> It will cause OOM and the system hangs

OOM means you trigger this problem... Plug hold lots of bios and cost
lots of memory, it's not that write back is disabled, you can verify
this by monitor md inflight, noted that don't use too much memory for
ramdisk(rd_nr * rd_size) in the test so that OOM won't be triggered.

Have you tried to test with this patchset?

>
> modprobe brd rd_nr=4 rd_size=10485760
> mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
> echo 1 > /proc/sys/vm/dirty_background_ratio (THIS is the only different place)
> fio -filename=/dev/md0 -ioengine=libaio -rw=write -thread -bs=1k-8k
> -numjobs=1 -iodepth=128 --runtime=10 -name=xxx
> It can finish successfully. The value of dirty_background_ration is 1
> here means it flushes ASAP

This really doesn't mean flushes ASAP, our test report this problem in
the real test that doesn't modify dirty_background_ratio. I guess
somewhere triggers io_scheduler(), probably background thread think
dirty pages doesn't match threshold, but I'm not sure for now.

Thanks,
Kuai
>
> So your method should be the opposite way as you designed. All the
> memory can't be flushed in time, so it uses all memory very soon and
> the memory runs out and the system hangs. The reason I'm looking at
> the test is that do we really need this change. Because in the real
> world, most customers don't disable write back. Anyway, it depends on
> Song's decision and thanks for your patches again. I'll review V3 and
> try to do some performance tests.
>
> Best Regards
> Xiao


2023-05-30 02:59:55

by Xiao Ni

[permalink] [raw]
Subject: Re: [PATCH -next v2 7/7] md/raid1-10: limit the number of plugged bio

On Tue, May 30, 2023 at 9:20 AM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/05/30 8:58, Xiao Ni 写道:
> > On Mon, May 29, 2023 at 4:50 PM Yu Kuai <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> 在 2023/05/29 15:57, Xiao Ni 写道:
> >>> On Mon, May 29, 2023 at 11:18 AM Yu Kuai <[email protected]> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> 在 2023/05/29 11:10, Xiao Ni 写道:
> >>>>> On Mon, May 29, 2023 at 10:20 AM Yu Kuai <[email protected]> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> 在 2023/05/29 10:08, Xiao Ni 写道:
> >>>>>>> Hi Kuai
> >>>>>>>
> >>>>>>> There is a limitation of the memory in your test. But for most
> >>>>>>> situations, customers should not set this. Can this change introduce a
> >>>>>>> performance regression against other situations?
> >>>>>>
> >>>>>> Noted that this limitation is just to triggered writeback as soon as
> >>>>>> possible in the test, and it's 100% sure real situations can trigger
> >>>>>> dirty pages write back asynchronously and continue to produce new dirty
> >>>>>> pages.
> >>>>>
> >>>>> Hi
> >>>>>
> >>>>> I'm confused here. If we want to trigger write back quickly, it needs
> >>>>> to set these two values with a smaller number, rather than 0 and 60.
> >>>>> Right?
> >>>>
> >>>> 60 is not required, I'll remove this setting.
> >>>>
> >>>> 0 just means write back if there are any dirty pages.
> >>>
> >>> Hi Kuai
> >>>
> >>> Does 0 mean disabling write back? I tried to find the doc that
> >>> describes the meaning when setting dirty_background_ratio to 0, but I
> >>> didn't find it.
> >>> In https://www.kernel.org/doc/html/next/admin-guide/sysctl/vm.html it
> >>> doesn't describe this. But it says something like this
> >>>
> >>> Note:
> >>> dirty_background_bytes is the counterpart of dirty_background_ratio. Only
> >>> one of them may be specified at a time. When one sysctl is written it is
> >>> immediately taken into account to evaluate the dirty memory limits and the
> >>> other appears as 0 when read.
> >>>
> >>> Maybe you can specify dirty_background_ratio to 1 if you want to
> >>> trigger write back ASAP.
> >>
> >> The purpose here is to trigger write back ASAP, I'm not an expert here,
> >> but based on test result, 0 obviously doesn't mean disable write back.
> >>
> >> Set dirty_background_bytes to a value, dirty_background_ratio will be
> >> set to 0 together, which means dirty_background_ratio is disabled.
> >> However, change dirty_background_ratio from default value to 0, will end
> >> up both dirty_background_ratio and dirty_background_bytes to be 0, and
> >> based on following related code, I think 0 just means write back if
> >> there are any dirty pages.
> >>
> >> domain_dirty_limits:
> >> bg_bytes = dirty_background_bytes -> 0
> >> bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100 -> 0
> >>
> >> if (bg_bytes)
> >> bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE);
> >> else
> >> bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE; -> 0
> >>
> >> dtc->bg_thresh = bg_thresh; -> 0
> >>
> >> balance_dirty_pages
> >> nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
> >> if (!laptop_mode && nr_reclaimable > gdtc->bg_thresh &&
> >> !writeback_in_progress(wb))
> >> wb_start_background_writeback(wb); -> writeback ASAP
> >>
> >> Thanks,
> >> Kuai
> >
> > Hi Kuai
> >
> > I'm not an expert about this either. Thanks for all your patches, I
> > can study more things too. But I still have some questions.
> >
> > I did a test in my environment something like this:
> > modprobe brd rd_nr=4 rd_size=10485760
> > mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
> > echo 0 > /proc/sys/vm/dirty_background_ratio
> > fio -filename=/dev/md0 -ioengine=libaio -rw=write -thread -bs=1k-8k
> > -numjobs=1 -iodepth=128 --runtime=10 -name=xxx
> > It will cause OOM and the system hangs
>
> OOM means you trigger this problem... Plug hold lots of bios and cost
> lots of memory, it's not that write back is disabled, you can verify
> this by monitor md inflight, noted that don't use too much memory for
> ramdisk(rd_nr * rd_size) in the test so that OOM won't be triggered.
>
> Have you tried to test with this patchset?

Yes, I know I have reproduced this problem. I'll have the v3 patchest.
>
> >
> > modprobe brd rd_nr=4 rd_size=10485760
> > mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
> > echo 1 > /proc/sys/vm/dirty_background_ratio (THIS is the only different place)
> > fio -filename=/dev/md0 -ioengine=libaio -rw=write -thread -bs=1k-8k
> > -numjobs=1 -iodepth=128 --runtime=10 -name=xxx
> > It can finish successfully. The value of dirty_background_ration is 1
> > here means it flushes ASAP
>
> This really doesn't mean flushes ASAP, our test report this problem in
> the real test that doesn't modify dirty_background_ratio. I guess
> somewhere triggers io_scheduler(), probably background thread think
> dirty pages doesn't match threshold, but I'm not sure for now.

Thanks for notifying me of this.

Regards
Xiao
>
> Thanks,
> Kuai
> >
> > So your method should be the opposite way as you designed. All the
> > memory can't be flushed in time, so it uses all memory very soon and
> > the memory runs out and the system hangs. The reason I'm looking at
> > the test is that do we really need this change. Because in the real
> > world, most customers don't disable write back. Anyway, it depends on
> > Song's decision and thanks for your patches again. I'll review V3 and
> > try to do some performance tests.
> >
> > Best Regards
> > Xiao
>