LinuxLists.cc - [PATCH v3 0/1] parallel 'copy-from' Ops in copy_file

2020-02-03 18:56:07

Subject: [PATCH v3 0/1] parallel 'copy-from' Ops in copy_file_range

Hi,

Here's another re-spin of the patch to improve copy_file_range performance
by parallelizing the OSD requests.

Changelog since v2:

- Refactor copy loop to ensure 'ret' is updated with the number of bytes
already copied only after the ceph_osdc_wait_requests() returns. This
meant to change this helper function to also return the number of
successful requests. Also fixed error handling.

- Use counter for # copies instead of re-calculating ncopies (now
copy_count) everytime time it reaches zero.

- Dropped 'dout' in ceph_osdc_wait_requests

- Have ceph_osdc_wait_requests and ceph_osdc_wait_request functions close

Changelog since v1 (the RFC):

- Dropped synchronous version of ceph_osdc_copy_from
* This was the reason for merging the patchset into a single patch, as
changing ceph_osdc_copy_from definition would break bisectability

- Moved wait_copy_from_reqs into libceph, renaming it to
ceph_osdc_wait_requests

- Use ->r_private_item instead of adding a new list_head into struct
ceph_osd_request

- Dropped the module parameter (used for testing) and added the throttling
mechanism using the formula suggested by Ilya.

For reference, here's the original RFC cover letter (fixed typo in results
table: 'throttle=5' should be 'throttle=0').

--------------------------------------------------------------------------

As discussed here[1] I'm sending an RFC patchset that does the
parallelization of the requests sent to the OSDs during a copy_file_range
syscall in CephFS.

[1] https://lore.kernel.org/lkml/[email protected]/

I've also some performance numbers that I wanted to share. Here's a
description of the very simple tests I've run:

- create a file with 200 objects in it
* i.e. tests with different object sizes mean different file sizes
- drop all caches and umount the filesystem
- Measure:
* mount filesystem
* full file copy (with copy_file_range)
* umount filesystem

Tests were repeated several times and the average value was used for
comparison.

DISCLAIMER:
These numbers are only indicative, and different clusters and client
configs will for sure show different performance! More rigorous tests
would be require to validate these results.

Having as baseline a full read+write (basically, a copy_file_range
operation within a filesystem mounted without the 'copyfrom' option),
here's some values for different object sizes:

8M 4M 1M 65k
read+write 100% 100% 100% 100%
sequential 51% 52% 83% >100%
parallel (throttle=1) 51% 52% 83% >100%
parallel (throttle=0) 17% 17% 83% >100%

Notes:

- 'parallel (throttle=0)' was a test where *all* the requests (i.e. 200
requests to copy the 200 objects in the file) were sent to the OSDs and
the wait for requests completion is done at the end only.

- 'parallel (throttle=1)' was just a control test, where the wait for
completion is done immediately after a request is sent. It was expected
to be very similar to the non-optimized ('sequential') tests.

- These tests were executed on a cluster with 40 OSDs, spread across 5
(bare-metal) nodes.

- The tests with object size of 65k show that copy_file_range definitely
doesn't scale to files with small object sizes. '> 100%' actually means
more than 10x slower.

Measuring the mount+copy+umount masks the actual difference between
different throttle values due to the time spent in mount+umount. Thus,
there was no real difference between throttle=0 (send all and wait) and
throttle=20 (send 20, wait, send 20, ...). But here's what I observed
when measuring only the copy operation (4M object size):

read+write 100%
parallel (throttle=1) 56%
parallel (throttle=5) 23%
parallel (throttle=10) 14%
parallel (throttle=20) 9%
parallel (throttle=0) 5%

Anyway, I'll still need to revisit patch 0003 as it doesn't follow the
suggestion done by Jeff to *not* add another knob to fine-tune the
throttle value -- this patch adds a kernel parameter for a knob that I
wanted to use in my testing to observe different values of this throttle
limit.

The goal is to probably to drop this patch and do the throttling in patch
0002. I just need to come up with a decent heuristic. Jeff's suggestion
was to use rsize/wsize, which are set to 64M by default IIRC. Somehow I
feel that it should be related to the number of OSDs in the cluster
instead, but I'm not sure how. And testing these sort of heuristics would
require different clusters, which isn't particularly easy to get. Anyway,
comments are welcome!

Cheers,
--
Luis

Luis Henriques (1):
ceph: parallelize all copy-from requests in copy_file_range

fs/ceph/file.c | 45 +++++++++++++++++++++-----
include/linux/ceph/osd_client.h | 6 +++-
net/ceph/osd_client.c | 56 +++++++++++++++++++++++++--------
3 files changed, 85 insertions(+), 22 deletions(-)

2020-02-03 18:56:42

by Luis Henriques

[permalink] [raw]

Subject: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range

Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
operations, waiting for each request to complete before sending the next
one. This patch modifies copy_file_range so that all the 'copy-from'
operations are sent in bulk and waits for its completion at the end. This
will allow significant speed-ups, specially when sending requests for
different target OSDs.

Signed-off-by: Luis Henriques <[email protected]>
---
fs/ceph/file.c | 45 +++++++++++++++++++++-----
include/linux/ceph/osd_client.h | 6 +++-
net/ceph/osd_client.c | 56 +++++++++++++++++++++++++--------
3 files changed, 85 insertions(+), 22 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 1e6cdf2dfe90..b9d8ffafb8c5 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1943,12 +1943,15 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
struct ceph_object_locator src_oloc, dst_oloc;
struct ceph_object_id src_oid, dst_oid;
+ struct ceph_osd_request *req;
loff_t endoff = 0, size;
ssize_t ret = -EIO;
u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
u32 src_objlen, dst_objlen, object_size;
int src_got = 0, dst_got = 0, err, dirty;
+ unsigned int max_copies, copy_count, reqs_complete = 0;
bool do_final_copy = false;
+ LIST_HEAD(osd_reqs);

if (src_inode->i_sb != dst_inode->i_sb) {
struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
@@ -2083,6 +2086,13 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
goto out_caps;
}
object_size = src_ci->i_layout.object_size;
+
+ /*
+ * Throttle the object copies: max_copies holds the number of allowed
+ * in-flight 'copy-from' requests before waiting for their completion
+ */
+ max_copies = (src_fsc->mount_options->wsize / object_size) * 4;
+ copy_count = max_copies;
while (len >= object_size) {
ceph_calc_file_object_mapping(&src_ci->i_layout, src_off,
object_size, &src_objnum,
@@ -2097,7 +2107,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
ceph_oid_printf(&dst_oid, "%llx.%08llx",
dst_ci->i_vino.ino, dst_objnum);
/* Do an object remote copy */
- err = ceph_osdc_copy_from(
+ req = ceph_osdc_copy_from(
&src_fsc->client->osdc,
src_ci->i_vino.snap, 0,
&src_oid, &src_oloc,
@@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
- if (err) {
- if (err == -EOPNOTSUPP) {
- src_fsc->have_copy_from2 = false;
- pr_notice("OSDs don't support 'copy-from2'; "
- "disabling copy_file_range\n");
- }
+ if (IS_ERR(req)) {
+ err = PTR_ERR(req);
dout("ceph_osdc_copy_from returned %d\n", err);
+
+ /* wait for all queued requests */
+ ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
+ ret += reqs_complete * object_size; /* Update copied bytes */
if (!ret)
ret = err;
goto out_caps;
}
+ list_add(&req->r_private_item, &osd_reqs);
len -= object_size;
src_off += object_size;
dst_off += object_size;
- ret += object_size;
+ /*
+ * Wait requests if we've reached the maximum requests allowed
+ * or if this was the last copy
+ */
+ if ((--copy_count == 0) || (len < object_size)) {
+ err = ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
+ ret += reqs_complete * object_size; /* Update copied bytes */
+ if (err) {
+ if (err == -EOPNOTSUPP) {
+ src_fsc->have_copy_from2 = false;
+ pr_notice("OSDs don't support 'copy-from2'; "
+ "disabling copy_file_range\n");
+ }
+ if (!ret)
+ ret = err;
+ goto out_caps;
+ }
+ copy_count = max_copies;
+ }
}

if (len)
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 5a62dbd3f4c2..0121767cd65e 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -496,6 +496,9 @@ extern int ceph_osdc_start_request(struct ceph_osd_client *osdc,
extern void ceph_osdc_cancel_request(struct ceph_osd_request *req);
extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
struct ceph_osd_request *req);
+extern int ceph_osdc_wait_requests(struct list_head *osd_reqs,
+ unsigned int *reqs_complete);
+
extern void ceph_osdc_sync(struct ceph_osd_client *osdc);

extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc);
@@ -526,7 +529,8 @@ extern int ceph_osdc_writepages(struct ceph_osd_client *osdc,
struct timespec64 *mtime,
struct page **pages, int nr_pages);

-int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
+struct ceph_osd_request *ceph_osdc_copy_from(
+ struct ceph_osd_client *osdc,
u64 src_snapid, u64 src_version,
struct ceph_object_id *src_oid,
struct ceph_object_locator *src_oloc,
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index b68b376d8c2f..df9f342f860a 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -4531,6 +4531,35 @@ int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
}
EXPORT_SYMBOL(ceph_osdc_wait_request);

+/*
+ * wait for all requests to complete in list @osd_reqs, returning the number of
+ * successful completions in @reqs_complete
+ */
+int ceph_osdc_wait_requests(struct list_head *osd_reqs,
+ unsigned int *reqs_complete)
+{
+ struct ceph_osd_request *req;
+ int ret = 0, err;
+ unsigned int counter = 0;
+
+ while (!list_empty(osd_reqs)) {
+ req = list_first_entry(osd_reqs,
+ struct ceph_osd_request,
+ r_private_item);
+ list_del_init(&req->r_private_item);
+ err = ceph_osdc_wait_request(req->r_osdc, req);
+ if (!err)
+ counter++;
+ else if (!ret)
+ ret = err;
+ ceph_osdc_put_request(req);
+ }
+ *reqs_complete = counter;
+
+ return ret;
+}
+EXPORT_SYMBOL(ceph_osdc_wait_requests);
+
/*
* sync - wait for all in-flight requests to flush. avoid starvation.
*/
@@ -5346,23 +5375,24 @@ static int osd_req_op_copy_from_init(struct ceph_osd_request *req,
return 0;
}

-int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
- u64 src_snapid, u64 src_version,
- struct ceph_object_id *src_oid,
- struct ceph_object_locator *src_oloc,
- u32 src_fadvise_flags,
- struct ceph_object_id *dst_oid,
- struct ceph_object_locator *dst_oloc,
- u32 dst_fadvise_flags,
- u32 truncate_seq, u64 truncate_size,
- u8 copy_from_flags)
+struct ceph_osd_request *ceph_osdc_copy_from(
+ struct ceph_osd_client *osdc,
+ u64 src_snapid, u64 src_version,
+ struct ceph_object_id *src_oid,
+ struct ceph_object_locator *src_oloc,
+ u32 src_fadvise_flags,
+ struct ceph_object_id *dst_oid,
+ struct ceph_object_locator *dst_oloc,
+ u32 dst_fadvise_flags,
+ u32 truncate_seq, u64 truncate_size,
+ u8 copy_from_flags)
{
struct ceph_osd_request *req;
int ret;

req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_KERNEL);
if (!req)
- return -ENOMEM;
+ return ERR_PTR(-ENOMEM);

req->r_flags = CEPH_OSD_FLAG_WRITE;

@@ -5381,11 +5411,11 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
goto out;

ceph_osdc_start_request(osdc, req, false);
- ret = ceph_osdc_wait_request(osdc, req);
+ return req;

out:
ceph_osdc_put_request(req);
- return ret;
+ return ERR_PTR(ret);
}
EXPORT_SYMBOL(ceph_osdc_copy_from);

2020-02-04 10:58:53

by Ilya Dryomov

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range

On Mon, Feb 3, 2020 at 5:51 PM Luis Henriques <[email protected]> wrote:
>
> Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
> operations, waiting for each request to complete before sending the next
> one. This patch modifies copy_file_range so that all the 'copy-from'
> operations are sent in bulk and waits for its completion at the end. This
> will allow significant speed-ups, specially when sending requests for
> different target OSDs.
>
> Signed-off-by: Luis Henriques <[email protected]>
> ---
> fs/ceph/file.c | 45 +++++++++++++++++++++-----
> include/linux/ceph/osd_client.h | 6 +++-
> net/ceph/osd_client.c | 56 +++++++++++++++++++++++++--------
> 3 files changed, 85 insertions(+), 22 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 1e6cdf2dfe90..b9d8ffafb8c5 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1943,12 +1943,15 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> struct ceph_object_locator src_oloc, dst_oloc;
> struct ceph_object_id src_oid, dst_oid;
> + struct ceph_osd_request *req;
> loff_t endoff = 0, size;
> ssize_t ret = -EIO;
> u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
> u32 src_objlen, dst_objlen, object_size;
> int src_got = 0, dst_got = 0, err, dirty;
> + unsigned int max_copies, copy_count, reqs_complete = 0;
> bool do_final_copy = false;
> + LIST_HEAD(osd_reqs);
>
> if (src_inode->i_sb != dst_inode->i_sb) {
> struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> @@ -2083,6 +2086,13 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> goto out_caps;
> }
> object_size = src_ci->i_layout.object_size;
> +
> + /*
> + * Throttle the object copies: max_copies holds the number of allowed
> + * in-flight 'copy-from' requests before waiting for their completion
> + */
> + max_copies = (src_fsc->mount_options->wsize / object_size) * 4;
> + copy_count = max_copies;
> while (len >= object_size) {
> ceph_calc_file_object_mapping(&src_ci->i_layout, src_off,
> object_size, &src_objnum,
> @@ -2097,7 +2107,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> ceph_oid_printf(&dst_oid, "%llx.%08llx",
> dst_ci->i_vino.ino, dst_objnum);
> /* Do an object remote copy */
> - err = ceph_osdc_copy_from(
> + req = ceph_osdc_copy_from(
> &src_fsc->client->osdc,
> src_ci->i_vino.snap, 0,
> &src_oid, &src_oloc,
> @@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> - if (err) {
> - if (err == -EOPNOTSUPP) {
> - src_fsc->have_copy_from2 = false;
> - pr_notice("OSDs don't support 'copy-from2'; "
> - "disabling copy_file_range\n");
> - }
> + if (IS_ERR(req)) {
> + err = PTR_ERR(req);
> dout("ceph_osdc_copy_from returned %d\n", err);
> +
> + /* wait for all queued requests */
> + ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> + ret += reqs_complete * object_size; /* Update copied bytes */

Hi Luis,

Looks like ret is still incremented unconditionally? What happens
if there are three OSD requests on the list and the first fails but
the second and the third succeed? As is, ceph_osdc_wait_requests()
will return an error with reqs_complete set to 2...

> if (!ret)
> ret = err;

... and we will return 8M instead of an error.

> goto out_caps;
> }
> + list_add(&req->r_private_item, &osd_reqs);
> len -= object_size;
> src_off += object_size;
> dst_off += object_size;
> - ret += object_size;
> + /*
> + * Wait requests if we've reached the maximum requests allowed
> + * or if this was the last copy
> + */
> + if ((--copy_count == 0) || (len < object_size)) {
> + err = ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> + ret += reqs_complete * object_size; /* Update copied bytes */

Same here.

> + if (err) {
> + if (err == -EOPNOTSUPP) {
> + src_fsc->have_copy_from2 = false;

Since EOPNOTSUPP is special in that it triggers the fallback, it
should be returned even if something was copied. Think about a
mixed cluster, where some OSDs support copy-from2 and some don't.
If the range is split between such OSDs, copy_file_range() will
always return short if the range happens to start on a new OSD.

> + pr_notice("OSDs don't support 'copy-from2'; "
> + "disabling copy_file_range\n");

This line is over 80 characters but shouldn't be split because it
is a grepable log message. Also, this message was slightly tweaked
in ceph-client.git, so this patch doesn't apply.

> + }
> + if (!ret)
> + ret = err;
> + goto out_caps;

I'm confused about out_caps. Since a short copy is still copy which
may change the size, update mtime and ctime times, etc why aren't we
dirtying caps in these cases?

> + }
> + copy_count = max_copies;
> + }
> }
>
> if (len)
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 5a62dbd3f4c2..0121767cd65e 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -496,6 +496,9 @@ extern int ceph_osdc_start_request(struct ceph_osd_client *osdc,
> extern void ceph_osdc_cancel_request(struct ceph_osd_request *req);
> extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
> struct ceph_osd_request *req);
> +extern int ceph_osdc_wait_requests(struct list_head *osd_reqs,
> + unsigned int *reqs_complete);
> +
> extern void ceph_osdc_sync(struct ceph_osd_client *osdc);
>
> extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc);
> @@ -526,7 +529,8 @@ extern int ceph_osdc_writepages(struct ceph_osd_client *osdc,
> struct timespec64 *mtime,
> struct page **pages, int nr_pages);
>
> -int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> +struct ceph_osd_request *ceph_osdc_copy_from(
> + struct ceph_osd_client *osdc,
> u64 src_snapid, u64 src_version,
> struct ceph_object_id *src_oid,
> struct ceph_object_locator *src_oloc,
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index b68b376d8c2f..df9f342f860a 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -4531,6 +4531,35 @@ int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
> }
> EXPORT_SYMBOL(ceph_osdc_wait_request);
>
> +/*
> + * wait for all requests to complete in list @osd_reqs, returning the number of
> + * successful completions in @reqs_complete
> + */
> +int ceph_osdc_wait_requests(struct list_head *osd_reqs,
> + unsigned int *reqs_complete)
> +{
> + struct ceph_osd_request *req;
> + int ret = 0, err;
> + unsigned int counter = 0;
> +
> + while (!list_empty(osd_reqs)) {
> + req = list_first_entry(osd_reqs,
> + struct ceph_osd_request,
> + r_private_item);
> + list_del_init(&req->r_private_item);
> + err = ceph_osdc_wait_request(req->r_osdc, req);
> + if (!err)
> + counter++;

I think you want to stop incrementing counter after encountering an
error...

Thanks,

Ilya

2020-02-04 13:31:11

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range

On Mon, 2020-02-03 at 16:51 +0000, Luis Henriques wrote:
> Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
> operations, waiting for each request to complete before sending the next
> one. This patch modifies copy_file_range so that all the 'copy-from'
> operations are sent in bulk and waits for its completion at the end. This
> will allow significant speed-ups, specially when sending requests for
> different target OSDs.
>

Looks good overall. A few nits below:

> Signed-off-by: Luis Henriques <[email protected]>
> ---
> fs/ceph/file.c | 45 +++++++++++++++++++++-----
> include/linux/ceph/osd_client.h | 6 +++-
> net/ceph/osd_client.c | 56 +++++++++++++++++++++++++--------
> 3 files changed, 85 insertions(+), 22 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 1e6cdf2dfe90..b9d8ffafb8c5 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1943,12 +1943,15 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> struct ceph_object_locator src_oloc, dst_oloc;
> struct ceph_object_id src_oid, dst_oid;
> + struct ceph_osd_request *req;
> loff_t endoff = 0, size;
> ssize_t ret = -EIO;
> u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
> u32 src_objlen, dst_objlen, object_size;
> int src_got = 0, dst_got = 0, err, dirty;
> + unsigned int max_copies, copy_count, reqs_complete = 0;
> bool do_final_copy = false;
> + LIST_HEAD(osd_reqs);
>
> if (src_inode->i_sb != dst_inode->i_sb) {
> struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> @@ -2083,6 +2086,13 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> goto out_caps;
> }
> object_size = src_ci->i_layout.object_size;
> +
> + /*
> + * Throttle the object copies: max_copies holds the number of allowed
> + * in-flight 'copy-from' requests before waiting for their completion
> + */
> + max_copies = (src_fsc->mount_options->wsize / object_size) * 4;

A note about why you chose to multiply by a factor of 4 here would be
good. In another year or two, we won't remember.

> + copy_count = max_copies;
> while (len >= object_size) {
> ceph_calc_file_object_mapping(&src_ci->i_layout, src_off,
> object_size, &src_objnum,
> @@ -2097,7 +2107,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> ceph_oid_printf(&dst_oid, "%llx.%08llx",
> dst_ci->i_vino.ino, dst_objnum);
> /* Do an object remote copy */
> - err = ceph_osdc_copy_from(
> + req = ceph_osdc_copy_from(
> &src_fsc->client->osdc,
> src_ci->i_vino.snap, 0,
> &src_oid, &src_oloc,
> @@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> - if (err) {
> - if (err == -EOPNOTSUPP) {
> - src_fsc->have_copy_from2 = false;
> - pr_notice("OSDs don't support 'copy-from2'; "
> - "disabling copy_file_range\n");
> - }
> + if (IS_ERR(req)) {
> + err = PTR_ERR(req);
> dout("ceph_osdc_copy_from returned %d\n", err);
> +
> + /* wait for all queued requests */
> + ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> + ret += reqs_complete * object_size; /* Update copied bytes */
> if (!ret)
> ret = err;
> goto out_caps;
> }
> + list_add(&req->r_private_item, &osd_reqs);
> len -= object_size;
> src_off += object_size;
> dst_off += object_size;
> - ret += object_size;
> + /*
> + * Wait requests if we've reached the maximum requests allowed

"Wait for requests..."

> + * or if this was the last copy
> + */
> + if ((--copy_count == 0) || (len < object_size)) {
> + err = ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> + ret += reqs_complete * object_size; /* Update copied bytes */
> + if (err) {
> + if (err == -EOPNOTSUPP) {
> + src_fsc->have_copy_from2 = false;
> + pr_notice("OSDs don't support 'copy-from2'; "
> + "disabling copy_file_range\n");
> + }
> + if (!ret)
> + ret = err;
> + goto out_caps;
> + }
> + copy_count = max_copies;
> + }
> }
>
> if (len)
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 5a62dbd3f4c2..0121767cd65e 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -496,6 +496,9 @@ extern int ceph_osdc_start_request(struct ceph_osd_client *osdc,
> extern void ceph_osdc_cancel_request(struct ceph_osd_request *req);
> extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
> struct ceph_osd_request *req);
> +extern int ceph_osdc_wait_requests(struct list_head *osd_reqs,
> + unsigned int *reqs_complete);
> +
> extern void ceph_osdc_sync(struct ceph_osd_client *osdc);
>
> extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc);
> @@ -526,7 +529,8 @@ extern int ceph_osdc_writepages(struct ceph_osd_client *osdc,
> struct timespec64 *mtime,
> struct page **pages, int nr_pages);
>
> -int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> +struct ceph_osd_request *ceph_osdc_copy_from(
> + struct ceph_osd_client *osdc,
> u64 src_snapid, u64 src_version,
> struct ceph_object_id *src_oid,
> struct ceph_object_locator *src_oloc,
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index b68b376d8c2f..df9f342f860a 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -4531,6 +4531,35 @@ int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
> }
> EXPORT_SYMBOL(ceph_osdc_wait_request);
>
> +/*
> + * wait for all requests to complete in list @osd_reqs, returning the number of
> + * successful completions in @reqs_complete
> + */

Maybe consider just having it return a positive reqs_complete value or a
negative error code, and drop the reqs_complete pointer argument? It'd
also be good to note what this function returns.

> +int ceph_osdc_wait_requests(struct list_head *osd_reqs,
> + unsigned int *reqs_complete)
> +{
> + struct ceph_osd_request *req;
> + int ret = 0, err;
> + unsigned int counter = 0;
> +
> + while (!list_empty(osd_reqs)) {
> + req = list_first_entry(osd_reqs,
> + struct ceph_osd_request,
> + r_private_item);
> + list_del_init(&req->r_private_item);
> + err = ceph_osdc_wait_request(req->r_osdc, req);

ceph_osdc_wait_request calls wait_request_timeout, which uses a killable
sleep. That's better than uninterruptible sleep, but maybe it'd be good
to use an interruptible sleep here instead? Having to send fatal signals
when things are stuck kind of sucks.

> + if (!err)
> + counter++;
> + else if (!ret)
> + ret = err;
> + ceph_osdc_put_request(req);
> + }
> + *reqs_complete = counter;
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(ceph_osdc_wait_requests);
> +
> /*
> * sync - wait for all in-flight requests to flush. avoid starvation.
> */
> @@ -5346,23 +5375,24 @@ static int osd_req_op_copy_from_init(struct ceph_osd_request *req,
> return 0;
> }
>
> -int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> - u64 src_snapid, u64 src_version,
> - struct ceph_object_id *src_oid,
> - struct ceph_object_locator *src_oloc,
> - u32 src_fadvise_flags,
> - struct ceph_object_id *dst_oid,
> - struct ceph_object_locator *dst_oloc,
> - u32 dst_fadvise_flags,
> - u32 truncate_seq, u64 truncate_size,
> - u8 copy_from_flags)
> +struct ceph_osd_request *ceph_osdc_copy_from(
> + struct ceph_osd_client *osdc,
> + u64 src_snapid, u64 src_version,
> + struct ceph_object_id *src_oid,
> + struct ceph_object_locator *src_oloc,
> + u32 src_fadvise_flags,
> + struct ceph_object_id *dst_oid,
> + struct ceph_object_locator *dst_oloc,
> + u32 dst_fadvise_flags,
> + u32 truncate_seq, u64 truncate_size,
> + u8 copy_from_flags)
> {
> struct ceph_osd_request *req;
> int ret;
>
> req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_KERNEL);
> if (!req)
> - return -ENOMEM;
> + return ERR_PTR(-ENOMEM);
>
> req->r_flags = CEPH_OSD_FLAG_WRITE;
>
> @@ -5381,11 +5411,11 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> goto out;
>
> ceph_osdc_start_request(osdc, req, false);
> - ret = ceph_osdc_wait_request(osdc, req);
> + return req;
>
> out:
> ceph_osdc_put_request(req);
> - return ret;
> + return ERR_PTR(ret);
> }
> EXPORT_SYMBOL(ceph_osdc_copy_from);
>

--
Jeff Layton <[email protected]>

2020-02-04 15:13:32

by Luis Henriques

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range

On Tue, Feb 04, 2020 at 11:56:57AM +0100, Ilya Dryomov wrote:
...
> > @@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> > dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > - if (err) {
> > - if (err == -EOPNOTSUPP) {
> > - src_fsc->have_copy_from2 = false;
> > - pr_notice("OSDs don't support 'copy-from2'; "
> > - "disabling copy_file_range\n");
> > - }
> > + if (IS_ERR(req)) {
> > + err = PTR_ERR(req);
> > dout("ceph_osdc_copy_from returned %d\n", err);
> > +
> > + /* wait for all queued requests */
> > + ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > + ret += reqs_complete * object_size; /* Update copied bytes */
>
> Hi Luis,
>
> Looks like ret is still incremented unconditionally? What happens
> if there are three OSD requests on the list and the first fails but
> the second and the third succeed? As is, ceph_osdc_wait_requests()
> will return an error with reqs_complete set to 2...
>
> > if (!ret)
> > ret = err;
>
> ... and we will return 8M instead of an error.

Right, my assumption was that if a request fails, all subsequent requests
would also fail. This would allow ret to be updated with the number of
successful requests (x object size), even if the OSDs replies were being
delivered in a different order. But from your comment I see that my
assumption is incorrect.

In that case, when shall ret be updated with the number of bytes already
written? Only after a successful call to ceph_osdc_wait_requests()?
I.e. only after each throttling cycle, when we don't have any requests
pending completion? In this case, I can simply drop the extra
reqs_complete parameter to the ceph_osdc_wait_requests.

In your example the right thing to do would be to simply return an error,
I guess. But then we're assuming that we're loosing space in the storage,
as we may have created objects that won't be reachable anymore.

>
> > goto out_caps;
> > }
> > + list_add(&req->r_private_item, &osd_reqs);
> > len -= object_size;
> > src_off += object_size;
> > dst_off += object_size;
> > - ret += object_size;
> > + /*
> > + * Wait requests if we've reached the maximum requests allowed
> > + * or if this was the last copy
> > + */
> > + if ((--copy_count == 0) || (len < object_size)) {
> > + err = ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > + ret += reqs_complete * object_size; /* Update copied bytes */
>
> Same here.
>
> > + if (err) {
> > + if (err == -EOPNOTSUPP) {
> > + src_fsc->have_copy_from2 = false;
>
> Since EOPNOTSUPP is special in that it triggers the fallback, it
> should be returned even if something was copied. Think about a
> mixed cluster, where some OSDs support copy-from2 and some don't.
> If the range is split between such OSDs, copy_file_range() will
> always return short if the range happens to start on a new OSD.

IMO, if we managed to copy some objects, we still need to return the
number of bytes copied. Because, since this return value will be less
then the expected amount of bytes, the application will retry the
operation. And at that point, since we've set have_copy_from2 to 'false',
the default VFS implementation will be used.

>
> > + pr_notice("OSDs don't support 'copy-from2'; "
> > + "disabling copy_file_range\n");
>
> This line is over 80 characters but shouldn't be split because it
> is a grepable log message. Also, this message was slightly tweaked
> in ceph-client.git, so this patch doesn't apply.

Ok.

> > + }
> > + if (!ret)
> > + ret = err;
> > + goto out_caps;
>
> I'm confused about out_caps. Since a short copy is still copy which
> may change the size, update mtime and ctime times, etc why aren't we
> dirtying caps in these cases?

Hmm... yeah, this does look like a bug already present in the code.

> > + }
> > + copy_count = max_copies;
> > + }
> > }
> >
> > if (len)
> > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> > index 5a62dbd3f4c2..0121767cd65e 100644
> > --- a/include/linux/ceph/osd_client.h
> > +++ b/include/linux/ceph/osd_client.h
> > @@ -496,6 +496,9 @@ extern int ceph_osdc_start_request(struct ceph_osd_client *osdc,
> > extern void ceph_osdc_cancel_request(struct ceph_osd_request *req);
> > extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
> > struct ceph_osd_request *req);
> > +extern int ceph_osdc_wait_requests(struct list_head *osd_reqs,
> > + unsigned int *reqs_complete);
> > +
> > extern void ceph_osdc_sync(struct ceph_osd_client *osdc);
> >
> > extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc);
> > @@ -526,7 +529,8 @@ extern int ceph_osdc_writepages(struct ceph_osd_client *osdc,
> > struct timespec64 *mtime,
> > struct page **pages, int nr_pages);
> >
> > -int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> > +struct ceph_osd_request *ceph_osdc_copy_from(
> > + struct ceph_osd_client *osdc,
> > u64 src_snapid, u64 src_version,
> > struct ceph_object_id *src_oid,
> > struct ceph_object_locator *src_oloc,
> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > index b68b376d8c2f..df9f342f860a 100644
> > --- a/net/ceph/osd_client.c
> > +++ b/net/ceph/osd_client.c
> > @@ -4531,6 +4531,35 @@ int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
> > }
> > EXPORT_SYMBOL(ceph_osdc_wait_request);
> >
> > +/*
> > + * wait for all requests to complete in list @osd_reqs, returning the number of
> > + * successful completions in @reqs_complete
> > + */
> > +int ceph_osdc_wait_requests(struct list_head *osd_reqs,
> > + unsigned int *reqs_complete)
> > +{
> > + struct ceph_osd_request *req;
> > + int ret = 0, err;
> > + unsigned int counter = 0;
> > +
> > + while (!list_empty(osd_reqs)) {
> > + req = list_first_entry(osd_reqs,
> > + struct ceph_osd_request,
> > + r_private_item);
> > + list_del_init(&req->r_private_item);
> > + err = ceph_osdc_wait_request(req->r_osdc, req);
> > + if (!err)
> > + counter++;
>
> I think you want to stop incrementing counter after encountering an
> error...
>
> Thanks,
>
> Ilya

Cheers,
--
Lu?s

2020-02-04 15:33:01

by Luis Henriques

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range

On Tue, Feb 04, 2020 at 08:30:03AM -0500, Jeff Layton wrote:
> On Mon, 2020-02-03 at 16:51 +0000, Luis Henriques wrote:
> > Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
> > operations, waiting for each request to complete before sending the next
> > one. This patch modifies copy_file_range so that all the 'copy-from'
> > operations are sent in bulk and waits for its completion at the end. This
> > will allow significant speed-ups, specially when sending requests for
> > different target OSDs.
> >
>
> Looks good overall. A few nits below:
>
> > Signed-off-by: Luis Henriques <[email protected]>
> > ---
> > fs/ceph/file.c | 45 +++++++++++++++++++++-----
> > include/linux/ceph/osd_client.h | 6 +++-
> > net/ceph/osd_client.c | 56 +++++++++++++++++++++++++--------
> > 3 files changed, 85 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 1e6cdf2dfe90..b9d8ffafb8c5 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -1943,12 +1943,15 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> > struct ceph_object_locator src_oloc, dst_oloc;
> > struct ceph_object_id src_oid, dst_oid;
> > + struct ceph_osd_request *req;
> > loff_t endoff = 0, size;
> > ssize_t ret = -EIO;
> > u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
> > u32 src_objlen, dst_objlen, object_size;
> > int src_got = 0, dst_got = 0, err, dirty;
> > + unsigned int max_copies, copy_count, reqs_complete = 0;
> > bool do_final_copy = false;
> > + LIST_HEAD(osd_reqs);
> >
> > if (src_inode->i_sb != dst_inode->i_sb) {
> > struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> > @@ -2083,6 +2086,13 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > goto out_caps;
> > }
> > object_size = src_ci->i_layout.object_size;
> > +
> > + /*
> > + * Throttle the object copies: max_copies holds the number of allowed
> > + * in-flight 'copy-from' requests before waiting for their completion
> > + */
> > + max_copies = (src_fsc->mount_options->wsize / object_size) * 4;
>
> A note about why you chose to multiply by a factor of 4 here would be
> good. In another year or two, we won't remember.

Sure, but to be honest I just picked an early suggestion from Ilya :-)
In practice it means that, by default, 64 will be the maximum requests
in-flight. I tested this value, and it looked OK although in the (very
humble) test cluster I've used a value of 16 (i.e. dropping the factor of
4) wasn't much worst.

> > + copy_count = max_copies;
> > while (len >= object_size) {
> > ceph_calc_file_object_mapping(&src_ci->i_layout, src_off,
> > object_size, &src_objnum,
> > @@ -2097,7 +2107,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > ceph_oid_printf(&dst_oid, "%llx.%08llx",
> > dst_ci->i_vino.ino, dst_objnum);
> > /* Do an object remote copy */
> > - err = ceph_osdc_copy_from(
> > + req = ceph_osdc_copy_from(
> > &src_fsc->client->osdc,
> > src_ci->i_vino.snap, 0,
> > &src_oid, &src_oloc,
> > @@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> > dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > - if (err) {
> > - if (err == -EOPNOTSUPP) {
> > - src_fsc->have_copy_from2 = false;
> > - pr_notice("OSDs don't support 'copy-from2'; "
> > - "disabling copy_file_range\n");
> > - }
> > + if (IS_ERR(req)) {
> > + err = PTR_ERR(req);
> > dout("ceph_osdc_copy_from returned %d\n", err);
> > +
> > + /* wait for all queued requests */
> > + ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > + ret += reqs_complete * object_size; /* Update copied bytes */
> > if (!ret)
> > ret = err;
> > goto out_caps;
> > }
> > + list_add(&req->r_private_item, &osd_reqs);
> > len -= object_size;
> > src_off += object_size;
> > dst_off += object_size;
> > - ret += object_size;
> > + /*
> > + * Wait requests if we've reached the maximum requests allowed
>
> "Wait for requests..."
>
> > + * or if this was the last copy
> > + */
> > + if ((--copy_count == 0) || (len < object_size)) {
> > + err = ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > + ret += reqs_complete * object_size; /* Update copied bytes */
> > + if (err) {
> > + if (err == -EOPNOTSUPP) {
> > + src_fsc->have_copy_from2 = false;
> > + pr_notice("OSDs don't support 'copy-from2'; "
> > + "disabling copy_file_range\n");
> > + }
> > + if (!ret)
> > + ret = err;
> > + goto out_caps;
> > + }
> > + copy_count = max_copies;
> > + }
> > }
> >
> > if (len)
> > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> > index 5a62dbd3f4c2..0121767cd65e 100644
> > --- a/include/linux/ceph/osd_client.h
> > +++ b/include/linux/ceph/osd_client.h
> > @@ -496,6 +496,9 @@ extern int ceph_osdc_start_request(struct ceph_osd_client *osdc,
> > extern void ceph_osdc_cancel_request(struct ceph_osd_request *req);
> > extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
> > struct ceph_osd_request *req);
> > +extern int ceph_osdc_wait_requests(struct list_head *osd_reqs,
> > + unsigned int *reqs_complete);
> > +
> > extern void ceph_osdc_sync(struct ceph_osd_client *osdc);
> >
> > extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc);
> > @@ -526,7 +529,8 @@ extern int ceph_osdc_writepages(struct ceph_osd_client *osdc,
> > struct timespec64 *mtime,
> > struct page **pages, int nr_pages);
> >
> > -int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> > +struct ceph_osd_request *ceph_osdc_copy_from(
> > + struct ceph_osd_client *osdc,
> > u64 src_snapid, u64 src_version,
> > struct ceph_object_id *src_oid,
> > struct ceph_object_locator *src_oloc,
> > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > index b68b376d8c2f..df9f342f860a 100644
> > --- a/net/ceph/osd_client.c
> > +++ b/net/ceph/osd_client.c
> > @@ -4531,6 +4531,35 @@ int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
> > }
> > EXPORT_SYMBOL(ceph_osdc_wait_request);
> >
> > +/*
> > + * wait for all requests to complete in list @osd_reqs, returning the number of
> > + * successful completions in @reqs_complete
> > + */
>
> Maybe consider just having it return a positive reqs_complete value or a
> negative error code, and drop the reqs_complete pointer argument? It'd
> also be good to note what this function returns.

In my (flawed) design I wanted to know that there was an error in a
request but also how many successful requests. But after the last review
from Ilya I'll probably need to revisit this anyway.

> > +int ceph_osdc_wait_requests(struct list_head *osd_reqs,
> > + unsigned int *reqs_complete)
> > +{
> > + struct ceph_osd_request *req;
> > + int ret = 0, err;
> > + unsigned int counter = 0;
> > +
> > + while (!list_empty(osd_reqs)) {
> > + req = list_first_entry(osd_reqs,
> > + struct ceph_osd_request,
> > + r_private_item);
> > + list_del_init(&req->r_private_item);
> > + err = ceph_osdc_wait_request(req->r_osdc, req);
>
> ceph_osdc_wait_request calls wait_request_timeout, which uses a killable
> sleep. That's better than uninterruptible sleep, but maybe it'd be good
> to use an interruptible sleep here instead? Having to send fatal signals
> when things are stuck kind of sucks.

Good point. It looks like Zheng changed this to a killable sleep in
commit 0e76abf21e76 ("libceph: make ceph_osdc_wait_request()
uninterruptible"). I guess you're suggesting to add a new function
(wait_request_uninterruptible_timeout) that would be used only here,
right?

Cheers,
--
Lu?s

2020-02-04 18:07:41

by Ilya Dryomov

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range

On Tue, Feb 4, 2020 at 4:11 PM Luis Henriques <[email protected]> wrote:
>
> On Tue, Feb 04, 2020 at 11:56:57AM +0100, Ilya Dryomov wrote:
> ...
> > > @@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> > > dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> > > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > > - if (err) {
> > > - if (err == -EOPNOTSUPP) {
> > > - src_fsc->have_copy_from2 = false;
> > > - pr_notice("OSDs don't support 'copy-from2'; "
> > > - "disabling copy_file_range\n");
> > > - }
> > > + if (IS_ERR(req)) {
> > > + err = PTR_ERR(req);
> > > dout("ceph_osdc_copy_from returned %d\n", err);
> > > +
> > > + /* wait for all queued requests */
> > > + ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > > + ret += reqs_complete * object_size; /* Update copied bytes */
> >
> > Hi Luis,
> >
> > Looks like ret is still incremented unconditionally? What happens
> > if there are three OSD requests on the list and the first fails but
> > the second and the third succeed? As is, ceph_osdc_wait_requests()
> > will return an error with reqs_complete set to 2...
> >
> > > if (!ret)
> > > ret = err;
> >
> > ... and we will return 8M instead of an error.
>
> Right, my assumption was that if a request fails, all subsequent requests
> would also fail. This would allow ret to be updated with the number of
> successful requests (x object size), even if the OSDs replies were being
> delivered in a different order. But from your comment I see that my
> assumption is incorrect.
>
> In that case, when shall ret be updated with the number of bytes already
> written? Only after a successful call to ceph_osdc_wait_requests()?

I mentioned this in the previous email: you probably want to change
ceph_osdc_wait_requests() so that the counter isn't incremented after
an error is encountered.

> I.e. only after each throttling cycle, when we don't have any requests
> pending completion? In this case, I can simply drop the extra
> reqs_complete parameter to the ceph_osdc_wait_requests.
>
> In your example the right thing to do would be to simply return an error,
> I guess. But then we're assuming that we're loosing space in the storage,
> as we may have created objects that won't be reachable anymore.

Well, that is what I'm getting at -- this needs a lot more
consideration. How errors are dealt with, how file metadata is
updated, when do we fall back to plain copy, etc. Generating stray
objects is bad but way better than reporting that e.g. 0..12M were
copied when only 0..4M and 8M..12M were actually copied, leaving
the user one step away from data loss. One option is to revert to
issuing copy-from requests serially when an error is encountered.
Another option is to fall back to plain copy on any error. Or perhaps
we just don't bother with the complexity of parallel copy-from requests
at all...

Of course, no matter what we do for parallel copy-from requests, the
existing short copy bug needs to be fixed separately.

>
> >
> > > goto out_caps;
> > > }
> > > + list_add(&req->r_private_item, &osd_reqs);
> > > len -= object_size;
> > > src_off += object_size;
> > > dst_off += object_size;
> > > - ret += object_size;
> > > + /*
> > > + * Wait requests if we've reached the maximum requests allowed
> > > + * or if this was the last copy
> > > + */
> > > + if ((--copy_count == 0) || (len < object_size)) {
> > > + err = ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > > + ret += reqs_complete * object_size; /* Update copied bytes */
> >
> > Same here.
> >
> > > + if (err) {
> > > + if (err == -EOPNOTSUPP) {
> > > + src_fsc->have_copy_from2 = false;
> >
> > Since EOPNOTSUPP is special in that it triggers the fallback, it
> > should be returned even if something was copied. Think about a
> > mixed cluster, where some OSDs support copy-from2 and some don't.
> > If the range is split between such OSDs, copy_file_range() will
> > always return short if the range happens to start on a new OSD.
>
> IMO, if we managed to copy some objects, we still need to return the
> number of bytes copied. Because, since this return value will be less
> then the expected amount of bytes, the application will retry the
> operation. And at that point, since we've set have_copy_from2 to 'false',
> the default VFS implementation will be used.

Ah, yeah, given have_copy_from2 guard, this particular corner case is
fine.

Thanks,

Ilya

2020-02-04 19:44:29

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range

On Tue, 2020-02-04 at 15:30 +0000, Luis Henriques wrote:
> On Tue, Feb 04, 2020 at 08:30:03AM -0500, Jeff Layton wrote:
> > On Mon, 2020-02-03 at 16:51 +0000, Luis Henriques wrote:
> > > Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
> > > operations, waiting for each request to complete before sending the next
> > > one. This patch modifies copy_file_range so that all the 'copy-from'
> > > operations are sent in bulk and waits for its completion at the end. This
> > > will allow significant speed-ups, specially when sending requests for
> > > different target OSDs.
> > >
> >
> > Looks good overall. A few nits below:
> >
> > > Signed-off-by: Luis Henriques <[email protected]>
> > > ---
> > > fs/ceph/file.c | 45 +++++++++++++++++++++-----
> > > include/linux/ceph/osd_client.h | 6 +++-
> > > net/ceph/osd_client.c | 56 +++++++++++++++++++++++++--------
> > > 3 files changed, 85 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > index 1e6cdf2dfe90..b9d8ffafb8c5 100644
> > > --- a/fs/ceph/file.c
> > > +++ b/fs/ceph/file.c
> > > @@ -1943,12 +1943,15 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> > > struct ceph_object_locator src_oloc, dst_oloc;
> > > struct ceph_object_id src_oid, dst_oid;
> > > + struct ceph_osd_request *req;
> > > loff_t endoff = 0, size;
> > > ssize_t ret = -EIO;
> > > u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
> > > u32 src_objlen, dst_objlen, object_size;
> > > int src_got = 0, dst_got = 0, err, dirty;
> > > + unsigned int max_copies, copy_count, reqs_complete = 0;
> > > bool do_final_copy = false;
> > > + LIST_HEAD(osd_reqs);
> > >
> > > if (src_inode->i_sb != dst_inode->i_sb) {
> > > struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> > > @@ -2083,6 +2086,13 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > goto out_caps;
> > > }
> > > object_size = src_ci->i_layout.object_size;
> > > +
> > > + /*
> > > + * Throttle the object copies: max_copies holds the number of allowed
> > > + * in-flight 'copy-from' requests before waiting for their completion
> > > + */
> > > + max_copies = (src_fsc->mount_options->wsize / object_size) * 4;
> >
> > A note about why you chose to multiply by a factor of 4 here would be
> > good. In another year or two, we won't remember.
>
> Sure, but to be honest I just picked an early suggestion from Ilya :-)
> In practice it means that, by default, 64 will be the maximum requests
> in-flight. I tested this value, and it looked OK although in the (very
> humble) test cluster I've used a value of 16 (i.e. dropping the factor of
> 4) wasn't much worst.
>

What happens if you ramp it up to be even more greedy? Does that speed
things up? It might be worth doing some experimentation with a tunable
too?

In any case though, we need to consider what the ideal default would be.
I'm now wondering whether the wsize is the right thing to base this on.
If you have a 1000 OSD cluster, then even 64 actually sounds a bit weak.

I wonder whether it should it be calculated on the fly from some
function of the number OSDs or PGs in the data pool? Maybe even
something like:

(number of PGs) / 2

...assuming the copies go between 2 PGs. With a perfect distribution, you'd
be able to keep all your OSDs busy that way and do the copy really quickly.

> > > + copy_count = max_copies;
> > > while (len >= object_size) {
> > > ceph_calc_file_object_mapping(&src_ci->i_layout, src_off,
> > > object_size, &src_objnum,
> > > @@ -2097,7 +2107,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > ceph_oid_printf(&dst_oid, "%llx.%08llx",
> > > dst_ci->i_vino.ino, dst_objnum);
> > > /* Do an object remote copy */
> > > - err = ceph_osdc_copy_from(
> > > + req = ceph_osdc_copy_from(
> > > &src_fsc->client->osdc,
> > > src_ci->i_vino.snap, 0,
> > > &src_oid, &src_oloc,
> > > @@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> > > dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> > > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > > - if (err) {
> > > - if (err == -EOPNOTSUPP) {
> > > - src_fsc->have_copy_from2 = false;
> > > - pr_notice("OSDs don't support 'copy-from2'; "
> > > - "disabling copy_file_range\n");
> > > - }
> > > + if (IS_ERR(req)) {
> > > + err = PTR_ERR(req);
> > > dout("ceph_osdc_copy_from returned %d\n", err);
> > > +
> > > + /* wait for all queued requests */
> > > + ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > > + ret += reqs_complete * object_size; /* Update copied bytes */
> > > if (!ret)
> > > ret = err;
> > > goto out_caps;
> > > }
> > > + list_add(&req->r_private_item, &osd_reqs);
> > > len -= object_size;
> > > src_off += object_size;
> > > dst_off += object_size;
> > > - ret += object_size;
> > > + /*
> > > + * Wait requests if we've reached the maximum requests allowed
> >
> > "Wait for requests..."
> >
> > > + * or if this was the last copy
> > > + */
> > > + if ((--copy_count == 0) || (len < object_size)) {
> > > + err = ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > > + ret += reqs_complete * object_size; /* Update copied bytes */
> > > + if (err) {
> > > + if (err == -EOPNOTSUPP) {
> > > + src_fsc->have_copy_from2 = false;
> > > + pr_notice("OSDs don't support 'copy-from2'; "
> > > + "disabling copy_file_range\n");
> > > + }
> > > + if (!ret)
> > > + ret = err;
> > > + goto out_caps;
> > > + }
> > > + copy_count = max_copies;
> > > + }
> > > }
> > >
> > > if (len)
> > > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> > > index 5a62dbd3f4c2..0121767cd65e 100644
> > > --- a/include/linux/ceph/osd_client.h
> > > +++ b/include/linux/ceph/osd_client.h
> > > @@ -496,6 +496,9 @@ extern int ceph_osdc_start_request(struct ceph_osd_client *osdc,
> > > extern void ceph_osdc_cancel_request(struct ceph_osd_request *req);
> > > extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
> > > struct ceph_osd_request *req);
> > > +extern int ceph_osdc_wait_requests(struct list_head *osd_reqs,
> > > + unsigned int *reqs_complete);
> > > +
> > > extern void ceph_osdc_sync(struct ceph_osd_client *osdc);
> > >
> > > extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc);
> > > @@ -526,7 +529,8 @@ extern int ceph_osdc_writepages(struct ceph_osd_client *osdc,
> > > struct timespec64 *mtime,
> > > struct page **pages, int nr_pages);
> > >
> > > -int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> > > +struct ceph_osd_request *ceph_osdc_copy_from(
> > > + struct ceph_osd_client *osdc,
> > > u64 src_snapid, u64 src_version,
> > > struct ceph_object_id *src_oid,
> > > struct ceph_object_locator *src_oloc,
> > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > > index b68b376d8c2f..df9f342f860a 100644
> > > --- a/net/ceph/osd_client.c
> > > +++ b/net/ceph/osd_client.c
> > > @@ -4531,6 +4531,35 @@ int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
> > > }
> > > EXPORT_SYMBOL(ceph_osdc_wait_request);
> > >
> > > +/*
> > > + * wait for all requests to complete in list @osd_reqs, returning the number of
> > > + * successful completions in @reqs_complete
> > > + */
> >
> > Maybe consider just having it return a positive reqs_complete value or a
> > negative error code, and drop the reqs_complete pointer argument? It'd
> > also be good to note what this function returns.
>
> In my (flawed) design I wanted to know that there was an error in a
> request but also how many successful requests. But after the last review
> from Ilya I'll probably need to revisit this anyway.
>

Sounds good.

> > > +int ceph_osdc_wait_requests(struct list_head *osd_reqs,
> > > + unsigned int *reqs_complete)
> > > +{
> > > + struct ceph_osd_request *req;
> > > + int ret = 0, err;
> > > + unsigned int counter = 0;
> > > +
> > > + while (!list_empty(osd_reqs)) {
> > > + req = list_first_entry(osd_reqs,
> > > + struct ceph_osd_request,
> > > + r_private_item);
> > > + list_del_init(&req->r_private_item);
> > > + err = ceph_osdc_wait_request(req->r_osdc, req);
> >
> > ceph_osdc_wait_request calls wait_request_timeout, which uses a killable
> > sleep. That's better than uninterruptible sleep, but maybe it'd be good
> > to use an interruptible sleep here instead? Having to send fatal signals
> > when things are stuck kind of sucks.
>
> Good point. It looks like Zheng changed this to a killable sleep in
> commit 0e76abf21e76 ("libceph: make ceph_osdc_wait_request()
> uninterruptible"). I guess you're suggesting to add a new function
> (wait_request_uninterruptible_timeout) that would be used only here,
> right?
>

Yes, basically. We're not dealing with writeback in this function, so
I'm not so worried about things getting interrupted. If the user wants
to hit ^c here I don't see any reason to wait.

--
Jeff Layton <[email protected]>

2020-02-04 20:33:23

by Gregory Farnum

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range

On Tue, Feb 4, 2020 at 11:42 AM Jeff Layton <[email protected]> wrote:
>
> On Tue, 2020-02-04 at 15:30 +0000, Luis Henriques wrote:
> > On Tue, Feb 04, 2020 at 08:30:03AM -0500, Jeff Layton wrote:
> > > On Mon, 2020-02-03 at 16:51 +0000, Luis Henriques wrote:
> > > > Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
> > > > operations, waiting for each request to complete before sending the next
> > > > one. This patch modifies copy_file_range so that all the 'copy-from'
> > > > operations are sent in bulk and waits for its completion at the end. This
> > > > will allow significant speed-ups, specially when sending requests for
> > > > different target OSDs.
> > > >
> > >
> > > Looks good overall. A few nits below:
> > >
> > > > Signed-off-by: Luis Henriques <[email protected]>
> > > > ---
> > > > fs/ceph/file.c | 45 +++++++++++++++++++++-----
> > > > include/linux/ceph/osd_client.h | 6 +++-
> > > > net/ceph/osd_client.c | 56 +++++++++++++++++++++++++--------
> > > > 3 files changed, 85 insertions(+), 22 deletions(-)
> > > >
> > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > index 1e6cdf2dfe90..b9d8ffafb8c5 100644
> > > > --- a/fs/ceph/file.c
> > > > +++ b/fs/ceph/file.c
> > > > @@ -1943,12 +1943,15 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > > struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> > > > struct ceph_object_locator src_oloc, dst_oloc;
> > > > struct ceph_object_id src_oid, dst_oid;
> > > > + struct ceph_osd_request *req;
> > > > loff_t endoff = 0, size;
> > > > ssize_t ret = -EIO;
> > > > u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
> > > > u32 src_objlen, dst_objlen, object_size;
> > > > int src_got = 0, dst_got = 0, err, dirty;
> > > > + unsigned int max_copies, copy_count, reqs_complete = 0;
> > > > bool do_final_copy = false;
> > > > + LIST_HEAD(osd_reqs);
> > > >
> > > > if (src_inode->i_sb != dst_inode->i_sb) {
> > > > struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> > > > @@ -2083,6 +2086,13 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > > goto out_caps;
> > > > }
> > > > object_size = src_ci->i_layout.object_size;
> > > > +
> > > > + /*
> > > > + * Throttle the object copies: max_copies holds the number of allowed
> > > > + * in-flight 'copy-from' requests before waiting for their completion
> > > > + */
> > > > + max_copies = (src_fsc->mount_options->wsize / object_size) * 4;
> > >
> > > A note about why you chose to multiply by a factor of 4 here would be
> > > good. In another year or two, we won't remember.
> >
> > Sure, but to be honest I just picked an early suggestion from Ilya :-)
> > In practice it means that, by default, 64 will be the maximum requests
> > in-flight. I tested this value, and it looked OK although in the (very
> > humble) test cluster I've used a value of 16 (i.e. dropping the factor of
> > 4) wasn't much worst.
> >
>
> What happens if you ramp it up to be even more greedy? Does that speed
> things up? It might be worth doing some experimentation with a tunable
> too?
>
> In any case though, we need to consider what the ideal default would be.
> I'm now wondering whether the wsize is the right thing to base this on.
> If you have a 1000 OSD cluster, then even 64 actually sounds a bit weak.
>
> I wonder whether it should it be calculated on the fly from some
> function of the number OSDs or PGs in the data pool? Maybe even
> something like:
>
> (number of PGs) / 2
>
> ...assuming the copies go between 2 PGs. With a perfect distribution, you'd
> be able to keep all your OSDs busy that way and do the copy really quickly.

No, as I said before, let's not try and set this up so that a single
client can automatically saturate the cluster any more than we already
do with the normal readahead stuff. We're bad enough about throttling
and input limits without aggressively trying to hit them. :)
-Greg

>
> > > > + copy_count = max_copies;
> > > > while (len >= object_size) {
> > > > ceph_calc_file_object_mapping(&src_ci->i_layout, src_off,
> > > > object_size, &src_objnum,
> > > > @@ -2097,7 +2107,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > > ceph_oid_printf(&dst_oid, "%llx.%08llx",
> > > > dst_ci->i_vino.ino, dst_objnum);
> > > > /* Do an object remote copy */
> > > > - err = ceph_osdc_copy_from(
> > > > + req = ceph_osdc_copy_from(
> > > > &src_fsc->client->osdc,
> > > > src_ci->i_vino.snap, 0,
> > > > &src_oid, &src_oloc,
> > > > @@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > > CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> > > > dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> > > > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > > > - if (err) {
> > > > - if (err == -EOPNOTSUPP) {
> > > > - src_fsc->have_copy_from2 = false;
> > > > - pr_notice("OSDs don't support 'copy-from2'; "
> > > > - "disabling copy_file_range\n");
> > > > - }
> > > > + if (IS_ERR(req)) {
> > > > + err = PTR_ERR(req);
> > > > dout("ceph_osdc_copy_from returned %d\n", err);
> > > > +
> > > > + /* wait for all queued requests */
> > > > + ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > > > + ret += reqs_complete * object_size; /* Update copied bytes */
> > > > if (!ret)
> > > > ret = err;
> > > > goto out_caps;
> > > > }
> > > > + list_add(&req->r_private_item, &osd_reqs);
> > > > len -= object_size;
> > > > src_off += object_size;
> > > > dst_off += object_size;
> > > > - ret += object_size;
> > > > + /*
> > > > + * Wait requests if we've reached the maximum requests allowed
> > >
> > > "Wait for requests..."
> > >
> > > > + * or if this was the last copy
> > > > + */
> > > > + if ((--copy_count == 0) || (len < object_size)) {
> > > > + err = ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > > > + ret += reqs_complete * object_size; /* Update copied bytes */
> > > > + if (err) {
> > > > + if (err == -EOPNOTSUPP) {
> > > > + src_fsc->have_copy_from2 = false;
> > > > + pr_notice("OSDs don't support 'copy-from2'; "
> > > > + "disabling copy_file_range\n");
> > > > + }
> > > > + if (!ret)
> > > > + ret = err;
> > > > + goto out_caps;
> > > > + }
> > > > + copy_count = max_copies;
> > > > + }
> > > > }
> > > >
> > > > if (len)
> > > > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> > > > index 5a62dbd3f4c2..0121767cd65e 100644
> > > > --- a/include/linux/ceph/osd_client.h
> > > > +++ b/include/linux/ceph/osd_client.h
> > > > @@ -496,6 +496,9 @@ extern int ceph_osdc_start_request(struct ceph_osd_client *osdc,
> > > > extern void ceph_osdc_cancel_request(struct ceph_osd_request *req);
> > > > extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
> > > > struct ceph_osd_request *req);
> > > > +extern int ceph_osdc_wait_requests(struct list_head *osd_reqs,
> > > > + unsigned int *reqs_complete);
> > > > +
> > > > extern void ceph_osdc_sync(struct ceph_osd_client *osdc);
> > > >
> > > > extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc);
> > > > @@ -526,7 +529,8 @@ extern int ceph_osdc_writepages(struct ceph_osd_client *osdc,
> > > > struct timespec64 *mtime,
> > > > struct page **pages, int nr_pages);
> > > >
> > > > -int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
> > > > +struct ceph_osd_request *ceph_osdc_copy_from(
> > > > + struct ceph_osd_client *osdc,
> > > > u64 src_snapid, u64 src_version,
> > > > struct ceph_object_id *src_oid,
> > > > struct ceph_object_locator *src_oloc,
> > > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > > > index b68b376d8c2f..df9f342f860a 100644
> > > > --- a/net/ceph/osd_client.c
> > > > +++ b/net/ceph/osd_client.c
> > > > @@ -4531,6 +4531,35 @@ int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
> > > > }
> > > > EXPORT_SYMBOL(ceph_osdc_wait_request);
> > > >
> > > > +/*
> > > > + * wait for all requests to complete in list @osd_reqs, returning the number of
> > > > + * successful completions in @reqs_complete
> > > > + */
> > >
> > > Maybe consider just having it return a positive reqs_complete value or a
> > > negative error code, and drop the reqs_complete pointer argument? It'd
> > > also be good to note what this function returns.
> >
> > In my (flawed) design I wanted to know that there was an error in a
> > request but also how many successful requests. But after the last review
> > from Ilya I'll probably need to revisit this anyway.
> >
>
> Sounds good.
>
> > > > +int ceph_osdc_wait_requests(struct list_head *osd_reqs,
> > > > + unsigned int *reqs_complete)
> > > > +{
> > > > + struct ceph_osd_request *req;
> > > > + int ret = 0, err;
> > > > + unsigned int counter = 0;
> > > > +
> > > > + while (!list_empty(osd_reqs)) {
> > > > + req = list_first_entry(osd_reqs,
> > > > + struct ceph_osd_request,
> > > > + r_private_item);
> > > > + list_del_init(&req->r_private_item);
> > > > + err = ceph_osdc_wait_request(req->r_osdc, req);
> > >
> > > ceph_osdc_wait_request calls wait_request_timeout, which uses a killable
> > > sleep. That's better than uninterruptible sleep, but maybe it'd be good
> > > to use an interruptible sleep here instead? Having to send fatal signals
> > > when things are stuck kind of sucks.
> >
> > Good point. It looks like Zheng changed this to a killable sleep in
> > commit 0e76abf21e76 ("libceph: make ceph_osdc_wait_request()
> > uninterruptible"). I guess you're suggesting to add a new function
> > (wait_request_uninterruptible_timeout) that would be used only here,
> > right?
> >
>
> Yes, basically. We're not dealing with writeback in this function, so
> I'm not so worried about things getting interrupted. If the user wants
> to hit ^c here I don't see any reason to wait.
>
> --
> Jeff Layton <[email protected]>
>

2020-02-04 21:40:40

by Ilya Dryomov

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range

On Tue, Feb 4, 2020 at 8:42 PM Jeff Layton <[email protected]> wrote:
>
> On Tue, 2020-02-04 at 15:30 +0000, Luis Henriques wrote:
> > On Tue, Feb 04, 2020 at 08:30:03AM -0500, Jeff Layton wrote:
> > > On Mon, 2020-02-03 at 16:51 +0000, Luis Henriques wrote:
> > > > Right now the copy_file_range syscall serializes all the OSDs 'copy-from'
> > > > operations, waiting for each request to complete before sending the next
> > > > one. This patch modifies copy_file_range so that all the 'copy-from'
> > > > operations are sent in bulk and waits for its completion at the end. This
> > > > will allow significant speed-ups, specially when sending requests for
> > > > different target OSDs.
> > > >
> > >
> > > Looks good overall. A few nits below:
> > >
> > > > Signed-off-by: Luis Henriques <[email protected]>
> > > > ---
> > > > fs/ceph/file.c | 45 +++++++++++++++++++++-----
> > > > include/linux/ceph/osd_client.h | 6 +++-
> > > > net/ceph/osd_client.c | 56 +++++++++++++++++++++++++--------
> > > > 3 files changed, 85 insertions(+), 22 deletions(-)
> > > >
> > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > index 1e6cdf2dfe90..b9d8ffafb8c5 100644
> > > > --- a/fs/ceph/file.c
> > > > +++ b/fs/ceph/file.c
> > > > @@ -1943,12 +1943,15 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > > struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> > > > struct ceph_object_locator src_oloc, dst_oloc;
> > > > struct ceph_object_id src_oid, dst_oid;
> > > > + struct ceph_osd_request *req;
> > > > loff_t endoff = 0, size;
> > > > ssize_t ret = -EIO;
> > > > u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
> > > > u32 src_objlen, dst_objlen, object_size;
> > > > int src_got = 0, dst_got = 0, err, dirty;
> > > > + unsigned int max_copies, copy_count, reqs_complete = 0;
> > > > bool do_final_copy = false;
> > > > + LIST_HEAD(osd_reqs);
> > > >
> > > > if (src_inode->i_sb != dst_inode->i_sb) {
> > > > struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> > > > @@ -2083,6 +2086,13 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > > goto out_caps;
> > > > }
> > > > object_size = src_ci->i_layout.object_size;
> > > > +
> > > > + /*
> > > > + * Throttle the object copies: max_copies holds the number of allowed
> > > > + * in-flight 'copy-from' requests before waiting for their completion
> > > > + */
> > > > + max_copies = (src_fsc->mount_options->wsize / object_size) * 4;
> > >
> > > A note about why you chose to multiply by a factor of 4 here would be
> > > good. In another year or two, we won't remember.
> >
> > Sure, but to be honest I just picked an early suggestion from Ilya :-)
> > In practice it means that, by default, 64 will be the maximum requests
> > in-flight. I tested this value, and it looked OK although in the (very
> > humble) test cluster I've used a value of 16 (i.e. dropping the factor of
> > 4) wasn't much worst.
> >
>
> What happens if you ramp it up to be even more greedy? Does that speed
> things up? It might be worth doing some experimentation with a tunable
> too?
>
> In any case though, we need to consider what the ideal default would be.
> I'm now wondering whether the wsize is the right thing to base this on.
> If you have a 1000 OSD cluster, then even 64 actually sounds a bit weak.
>
> I wonder whether it should it be calculated on the fly from some
> function of the number OSDs or PGs in the data pool? Maybe even
> something like:
>
> (number of PGs) / 2

That can easily generate thousands of in-flight requests...

The OSD cluster may serve more than one filesystem, each of which may
serve more than one user. Zooming out, it's pretty common to have the
same cluster provide both block and file storage. Allowing a single
process to overrun the entire cluster is a bad idea, especially when
it is something as routine as cp.

I suggested that factor be between 1 and 4. My suggestion was based
on BLKDEV_MAX_RQ (128): anything above that is just unreasonable for
a single syscall and it needs to be cut by at least half because there
is no actual queue to limit the overall number of in-flight requests
for the filesystem.

Thanks,

Ilya

2020-02-05 11:01:18

by Luis Henriques

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range

On Tue, Feb 04, 2020 at 07:06:36PM +0100, Ilya Dryomov wrote:
> On Tue, Feb 4, 2020 at 4:11 PM Luis Henriques <[email protected]> wrote:
> >
> > On Tue, Feb 04, 2020 at 11:56:57AM +0100, Ilya Dryomov wrote:
> > ...
> > > > @@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > > CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> > > > dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> > > > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > > > - if (err) {
> > > > - if (err == -EOPNOTSUPP) {
> > > > - src_fsc->have_copy_from2 = false;
> > > > - pr_notice("OSDs don't support 'copy-from2'; "
> > > > - "disabling copy_file_range\n");
> > > > - }
> > > > + if (IS_ERR(req)) {
> > > > + err = PTR_ERR(req);
> > > > dout("ceph_osdc_copy_from returned %d\n", err);
> > > > +
> > > > + /* wait for all queued requests */
> > > > + ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > > > + ret += reqs_complete * object_size; /* Update copied bytes */
> > >
> > > Hi Luis,
> > >
> > > Looks like ret is still incremented unconditionally? What happens
> > > if there are three OSD requests on the list and the first fails but
> > > the second and the third succeed? As is, ceph_osdc_wait_requests()
> > > will return an error with reqs_complete set to 2...
> > >
> > > > if (!ret)
> > > > ret = err;
> > >
> > > ... and we will return 8M instead of an error.
> >
> > Right, my assumption was that if a request fails, all subsequent requests
> > would also fail. This would allow ret to be updated with the number of
> > successful requests (x object size), even if the OSDs replies were being
> > delivered in a different order. But from your comment I see that my
> > assumption is incorrect.
> >
> > In that case, when shall ret be updated with the number of bytes already
> > written? Only after a successful call to ceph_osdc_wait_requests()?
>
> I mentioned this in the previous email: you probably want to change
> ceph_osdc_wait_requests() so that the counter isn't incremented after
> an error is encountered.

Sure, I've seen that comment. But it doesn't help either because it's not
guaranteed that we'll receive the replies from the OSDs in the same order
we've sent them. Stopping the counter when we get an error doesn't
provide us any reliable information (which means I can simply drop that
counter).

> > I.e. only after each throttling cycle, when we don't have any requests
> > pending completion? In this case, I can simply drop the extra
> > reqs_complete parameter to the ceph_osdc_wait_requests.
> >
> > In your example the right thing to do would be to simply return an error,
> > I guess. But then we're assuming that we're loosing space in the storage,
> > as we may have created objects that won't be reachable anymore.
>
> Well, that is what I'm getting at -- this needs a lot more
> consideration. How errors are dealt with, how file metadata is
> updated, when do we fall back to plain copy, etc. Generating stray
> objects is bad but way better than reporting that e.g. 0..12M were
> copied when only 0..4M and 8M..12M were actually copied, leaving
> the user one step away from data loss. One option is to revert to
> issuing copy-from requests serially when an error is encountered.
> Another option is to fall back to plain copy on any error. Or perhaps
> we just don't bother with the complexity of parallel copy-from requests
> at all...

To be honest, I'm starting to lean towards this option. Reverting to
serializing requests or to plain copy on error will not necessarily
prevent the stray objects:

- send a bunch of copy requests
- wait for them to complete
* 1 failed, the other 63 succeeded
- revert to serialized copies, repeating the previous 64 requests
* after a few copies, we get another failure (maybe on the same OSDs)
and abort, leaving behind some stray objects from the previous bulk
request

I guess the only way around this would be some sort of atomic operation
that would allow us to copy a bunch of objects in a single operation
(copy-from3!)

I thought a bit a about this while watching Jeff and Patrick's talk at
FOSDEM (great talk, by the way!) but I don't think async directory
operations have this problem because the requests there are
metadata-related and a failure in a request is somewhat independent from
other requests.

> Of course, no matter what we do for parallel copy-from requests, the
> existing short copy bug needs to be fixed separately.

Yep, [email protected] should fix that (Cc'ed
stable@ as well).

Cheers,
--
Lu?s

>
> >
> > >
> > > > goto out_caps;
> > > > }
> > > > + list_add(&req->r_private_item, &osd_reqs);
> > > > len -= object_size;
> > > > src_off += object_size;
> > > > dst_off += object_size;
> > > > - ret += object_size;
> > > > + /*
> > > > + * Wait requests if we've reached the maximum requests allowed
> > > > + * or if this was the last copy
> > > > + */
> > > > + if ((--copy_count == 0) || (len < object_size)) {
> > > > + err = ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > > > + ret += reqs_complete * object_size; /* Update copied bytes */
> > >
> > > Same here.
> > >
> > > > + if (err) {
> > > > + if (err == -EOPNOTSUPP) {
> > > > + src_fsc->have_copy_from2 = false;
> > >
> > > Since EOPNOTSUPP is special in that it triggers the fallback, it
> > > should be returned even if something was copied. Think about a
> > > mixed cluster, where some OSDs support copy-from2 and some don't.
> > > If the range is split between such OSDs, copy_file_range() will
> > > always return short if the range happens to start on a new OSD.
> >
> > IMO, if we managed to copy some objects, we still need to return the
> > number of bytes copied. Because, since this return value will be less
> > then the expected amount of bytes, the application will retry the
> > operation. And at that point, since we've set have_copy_from2 to 'false',
> > the default VFS implementation will be used.
>
> Ah, yeah, given have_copy_from2 guard, this particular corner case is
> fine.
>
> Thanks,
>
> Ilya

2020-02-05 12:02:46

by Ilya Dryomov

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range

On Wed, Feb 5, 2020 at 12:00 PM Luis Henriques <[email protected]> wrote:
>
> On Tue, Feb 04, 2020 at 07:06:36PM +0100, Ilya Dryomov wrote:
> > On Tue, Feb 4, 2020 at 4:11 PM Luis Henriques <[email protected]> wrote:
> > >
> > > On Tue, Feb 04, 2020 at 11:56:57AM +0100, Ilya Dryomov wrote:
> > > ...
> > > > > @@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > > > CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> > > > > dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> > > > > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > > > > - if (err) {
> > > > > - if (err == -EOPNOTSUPP) {
> > > > > - src_fsc->have_copy_from2 = false;
> > > > > - pr_notice("OSDs don't support 'copy-from2'; "
> > > > > - "disabling copy_file_range\n");
> > > > > - }
> > > > > + if (IS_ERR(req)) {
> > > > > + err = PTR_ERR(req);
> > > > > dout("ceph_osdc_copy_from returned %d\n", err);
> > > > > +
> > > > > + /* wait for all queued requests */
> > > > > + ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > > > > + ret += reqs_complete * object_size; /* Update copied bytes */
> > > >
> > > > Hi Luis,
> > > >
> > > > Looks like ret is still incremented unconditionally? What happens
> > > > if there are three OSD requests on the list and the first fails but
> > > > the second and the third succeed? As is, ceph_osdc_wait_requests()
> > > > will return an error with reqs_complete set to 2...
> > > >
> > > > > if (!ret)
> > > > > ret = err;
> > > >
> > > > ... and we will return 8M instead of an error.
> > >
> > > Right, my assumption was that if a request fails, all subsequent requests
> > > would also fail. This would allow ret to be updated with the number of
> > > successful requests (x object size), even if the OSDs replies were being
> > > delivered in a different order. But from your comment I see that my
> > > assumption is incorrect.
> > >
> > > In that case, when shall ret be updated with the number of bytes already
> > > written? Only after a successful call to ceph_osdc_wait_requests()?
> >
> > I mentioned this in the previous email: you probably want to change
> > ceph_osdc_wait_requests() so that the counter isn't incremented after
> > an error is encountered.
>
> Sure, I've seen that comment. But it doesn't help either because it's not
> guaranteed that we'll receive the replies from the OSDs in the same order
> we've sent them. Stopping the counter when we get an error doesn't
> provide us any reliable information (which means I can simply drop that
> counter).

The list is FIFO so even though replies may indeed arrive out of
order, ceph_osdc_wait_requests() will process them in order. If you
stop counting as soon as an error is encountered, you know for sure
that requests 1 through $COUNTER were successful and can safely
multiply it by object size.

>
> > > I.e. only after each throttling cycle, when we don't have any requests
> > > pending completion? In this case, I can simply drop the extra
> > > reqs_complete parameter to the ceph_osdc_wait_requests.
> > >
> > > In your example the right thing to do would be to simply return an error,
> > > I guess. But then we're assuming that we're loosing space in the storage,
> > > as we may have created objects that won't be reachable anymore.
> >
> > Well, that is what I'm getting at -- this needs a lot more
> > consideration. How errors are dealt with, how file metadata is
> > updated, when do we fall back to plain copy, etc. Generating stray
> > objects is bad but way better than reporting that e.g. 0..12M were
> > copied when only 0..4M and 8M..12M were actually copied, leaving
> > the user one step away from data loss. One option is to revert to
> > issuing copy-from requests serially when an error is encountered.
> > Another option is to fall back to plain copy on any error. Or perhaps
> > we just don't bother with the complexity of parallel copy-from requests
> > at all...
>
> To be honest, I'm starting to lean towards this option. Reverting to
> serializing requests or to plain copy on error will not necessarily
> prevent the stray objects:
>
> - send a bunch of copy requests
> - wait for them to complete
> * 1 failed, the other 63 succeeded
> - revert to serialized copies, repeating the previous 64 requests
> * after a few copies, we get another failure (maybe on the same OSDs)
> and abort, leaving behind some stray objects from the previous bulk
> request

Yeah, doing it serially makes the accounting a lot easier. If you
issue any OSD requests before updating the size, stray objects are
bound to happen -- that's why "how file metadata is updated" is one
of the important considerations.

Thanks,

Ilya

2020-02-05 13:14:15

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range

On Wed, 2020-02-05 at 13:01 +0100, Ilya Dryomov wrote:
> On Wed, Feb 5, 2020 at 12:00 PM Luis Henriques <[email protected]> wrote:
> > On Tue, Feb 04, 2020 at 07:06:36PM +0100, Ilya Dryomov wrote:
> > > On Tue, Feb 4, 2020 at 4:11 PM Luis Henriques <[email protected]> wrote:
> > > > On Tue, Feb 04, 2020 at 11:56:57AM +0100, Ilya Dryomov wrote:
> > > > ...
> > > > > > @@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > > > > CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> > > > > > dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> > > > > > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > > > > > - if (err) {
> > > > > > - if (err == -EOPNOTSUPP) {
> > > > > > - src_fsc->have_copy_from2 = false;
> > > > > > - pr_notice("OSDs don't support 'copy-from2'; "
> > > > > > - "disabling copy_file_range\n");
> > > > > > - }
> > > > > > + if (IS_ERR(req)) {
> > > > > > + err = PTR_ERR(req);
> > > > > > dout("ceph_osdc_copy_from returned %d\n", err);
> > > > > > +
> > > > > > + /* wait for all queued requests */
> > > > > > + ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > > > > > + ret += reqs_complete * object_size; /* Update copied bytes */
> > > > >
> > > > > Hi Luis,
> > > > >
> > > > > Looks like ret is still incremented unconditionally? What happens
> > > > > if there are three OSD requests on the list and the first fails but
> > > > > the second and the third succeed? As is, ceph_osdc_wait_requests()
> > > > > will return an error with reqs_complete set to 2...
> > > > >
> > > > > > if (!ret)
> > > > > > ret = err;
> > > > >
> > > > > ... and we will return 8M instead of an error.
> > > >
> > > > Right, my assumption was that if a request fails, all subsequent requests
> > > > would also fail. This would allow ret to be updated with the number of
> > > > successful requests (x object size), even if the OSDs replies were being
> > > > delivered in a different order. But from your comment I see that my
> > > > assumption is incorrect.
> > > >
> > > > In that case, when shall ret be updated with the number of bytes already
> > > > written? Only after a successful call to ceph_osdc_wait_requests()?
> > >
> > > I mentioned this in the previous email: you probably want to change
> > > ceph_osdc_wait_requests() so that the counter isn't incremented after
> > > an error is encountered.
> >
> > Sure, I've seen that comment. But it doesn't help either because it's not
> > guaranteed that we'll receive the replies from the OSDs in the same order
> > we've sent them. Stopping the counter when we get an error doesn't
> > provide us any reliable information (which means I can simply drop that
> > counter).
>
> The list is FIFO so even though replies may indeed arrive out of
> order, ceph_osdc_wait_requests() will process them in order. If you
> stop counting as soon as an error is encountered, you know for sure
> that requests 1 through $COUNTER were successful and can safely
> multiply it by object size.
>
> > > > I.e. only after each throttling cycle, when we don't have any requests
> > > > pending completion? In this case, I can simply drop the extra
> > > > reqs_complete parameter to the ceph_osdc_wait_requests.
> > > >
> > > > In your example the right thing to do would be to simply return an error,
> > > > I guess. But then we're assuming that we're loosing space in the storage,
> > > > as we may have created objects that won't be reachable anymore.
> > >
> > > Well, that is what I'm getting at -- this needs a lot more
> > > consideration. How errors are dealt with, how file metadata is
> > > updated, when do we fall back to plain copy, etc. Generating stray
> > > objects is bad but way better than reporting that e.g. 0..12M were
> > > copied when only 0..4M and 8M..12M were actually copied, leaving
> > > the user one step away from data loss. One option is to revert to
> > > issuing copy-from requests serially when an error is encountered.
> > > Another option is to fall back to plain copy on any error. Or perhaps
> > > we just don't bother with the complexity of parallel copy-from requests
> > > at all...
> >
> > To be honest, I'm starting to lean towards this option. Reverting to
> > serializing requests or to plain copy on error will not necessarily
> > prevent the stray objects:
> >
> > - send a bunch of copy requests
> > - wait for them to complete
> > * 1 failed, the other 63 succeeded
> > - revert to serialized copies, repeating the previous 64 requests
> > * after a few copies, we get another failure (maybe on the same OSDs)
> > and abort, leaving behind some stray objects from the previous bulk
> > request
>
> Yeah, doing it serially makes the accounting a lot easier. If you
> issue any OSD requests before updating the size, stray objects are
> bound to happen -- that's why "how file metadata is updated" is one
> of the important considerations.

I don't think this is fundamentally different from the direct write
codepath. It's just that the source of the write is another OSD rather
than being sent in the request.

If you look at ceph_direct_read_write(), then it does this:

/*
* To simplify error handling, allow AIO when IO within i_size
* or IO can be satisfied by single OSD request.
*/

So, that's another possibility. Do the requests synchronously when they
will result in a change to i_size. Otherwise, you can do them in
parallel.

In practice, we'd have to recommend that people truncate the destination
file up to the final length before doing copy_file_range, but that
doesn't sound too onerous.

--
Jeff Layton <[email protected]>

2020-02-05 19:42:41

by Luis Henriques

[permalink] [raw]

Subject: Re: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range

On Wed, Feb 05, 2020 at 08:12:22AM -0500, Jeff Layton wrote:
> On Wed, 2020-02-05 at 13:01 +0100, Ilya Dryomov wrote:
> > On Wed, Feb 5, 2020 at 12:00 PM Luis Henriques <[email protected]> wrote:
> > > On Tue, Feb 04, 2020 at 07:06:36PM +0100, Ilya Dryomov wrote:
> > > > On Tue, Feb 4, 2020 at 4:11 PM Luis Henriques <[email protected]> wrote:
> > > > > On Tue, Feb 04, 2020 at 11:56:57AM +0100, Ilya Dryomov wrote:
> > > > > ...
> > > > > > > @@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > > > > > CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
> > > > > > > dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
> > > > > > > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
> > > > > > > - if (err) {
> > > > > > > - if (err == -EOPNOTSUPP) {
> > > > > > > - src_fsc->have_copy_from2 = false;
> > > > > > > - pr_notice("OSDs don't support 'copy-from2'; "
> > > > > > > - "disabling copy_file_range\n");
> > > > > > > - }
> > > > > > > + if (IS_ERR(req)) {
> > > > > > > + err = PTR_ERR(req);
> > > > > > > dout("ceph_osdc_copy_from returned %d\n", err);
> > > > > > > +
> > > > > > > + /* wait for all queued requests */
> > > > > > > + ceph_osdc_wait_requests(&osd_reqs, &reqs_complete);
> > > > > > > + ret += reqs_complete * object_size; /* Update copied bytes */
> > > > > >
> > > > > > Hi Luis,
> > > > > >
> > > > > > Looks like ret is still incremented unconditionally? What happens
> > > > > > if there are three OSD requests on the list and the first fails but
> > > > > > the second and the third succeed? As is, ceph_osdc_wait_requests()
> > > > > > will return an error with reqs_complete set to 2...
> > > > > >
> > > > > > > if (!ret)
> > > > > > > ret = err;
> > > > > >
> > > > > > ... and we will return 8M instead of an error.
> > > > >
> > > > > Right, my assumption was that if a request fails, all subsequent requests
> > > > > would also fail. This would allow ret to be updated with the number of
> > > > > successful requests (x object size), even if the OSDs replies were being
> > > > > delivered in a different order. But from your comment I see that my
> > > > > assumption is incorrect.
> > > > >
> > > > > In that case, when shall ret be updated with the number of bytes already
> > > > > written? Only after a successful call to ceph_osdc_wait_requests()?
> > > >
> > > > I mentioned this in the previous email: you probably want to change
> > > > ceph_osdc_wait_requests() so that the counter isn't incremented after
> > > > an error is encountered.
> > >
> > > Sure, I've seen that comment. But it doesn't help either because it's not
> > > guaranteed that we'll receive the replies from the OSDs in the same order
> > > we've sent them. Stopping the counter when we get an error doesn't
> > > provide us any reliable information (which means I can simply drop that
> > > counter).
> >
> > The list is FIFO so even though replies may indeed arrive out of
> > order, ceph_osdc_wait_requests() will process them in order. If you
> > stop counting as soon as an error is encountered, you know for sure
> > that requests 1 through $COUNTER were successful and can safely
> > multiply it by object size.
> >
> > > > > I.e. only after each throttling cycle, when we don't have any requests
> > > > > pending completion? In this case, I can simply drop the extra
> > > > > reqs_complete parameter to the ceph_osdc_wait_requests.
> > > > >
> > > > > In your example the right thing to do would be to simply return an error,
> > > > > I guess. But then we're assuming that we're loosing space in the storage,
> > > > > as we may have created objects that won't be reachable anymore.
> > > >
> > > > Well, that is what I'm getting at -- this needs a lot more
> > > > consideration. How errors are dealt with, how file metadata is
> > > > updated, when do we fall back to plain copy, etc. Generating stray
> > > > objects is bad but way better than reporting that e.g. 0..12M were
> > > > copied when only 0..4M and 8M..12M were actually copied, leaving
> > > > the user one step away from data loss. One option is to revert to
> > > > issuing copy-from requests serially when an error is encountered.
> > > > Another option is to fall back to plain copy on any error. Or perhaps
> > > > we just don't bother with the complexity of parallel copy-from requests
> > > > at all...
> > >
> > > To be honest, I'm starting to lean towards this option. Reverting to
> > > serializing requests or to plain copy on error will not necessarily
> > > prevent the stray objects:
> > >
> > > - send a bunch of copy requests
> > > - wait for them to complete
> > > * 1 failed, the other 63 succeeded
> > > - revert to serialized copies, repeating the previous 64 requests
> > > * after a few copies, we get another failure (maybe on the same OSDs)
> > > and abort, leaving behind some stray objects from the previous bulk
> > > request
> >
> > Yeah, doing it serially makes the accounting a lot easier. If you
> > issue any OSD requests before updating the size, stray objects are
> > bound to happen -- that's why "how file metadata is updated" is one
> > of the important considerations.
>
> I don't think this is fundamentally different from the direct write
> codepath. It's just that the source of the write is another OSD rather
> than being sent in the request.
>
> If you look at ceph_direct_read_write(), then it does this:
>
> /*
> * To simplify error handling, allow AIO when IO within i_size
> * or IO can be satisfied by single OSD request.
> */
>
> So, that's another possibility. Do the requests synchronously when they
> will result in a change to i_size. Otherwise, you can do them in
> parallel.

This could also work, but would add even more complexity. I already find
the __copy_file_range implementation way to complex (and the fact that I
keep adding bugs to it is a good indicator :-) ), although I know the
performance improvements provided by the parallel requests can be huge.
Anyway, let me (try to!) fix these other bugs that Ilya keeps finding and
then I'll revisit the whole thing.

Cheers,
--
Lu?s

> In practice, we'd have to recommend that people truncate the destination
> file up to the final length before doing copy_file_range, but that
> doesn't sound too onerous.
>
> --
> Jeff Layton <[email protected]>
>