LinuxLists.cc - Re: [PATCH] ceph: allow object copies across different filesystems in the same cluster

2019-09-07 09:23:54

Subject: Re: [PATCH] ceph: allow object copies across different filesystems in the same cluster

"Jeff Layton" <[email protected]> writes:

> On Fri, 2019-09-06 at 14:57 +0100, Luis Henriques wrote:
>> OSDs are able to perform object copies across different pools. Thus,
>> there's no need to prevent copy_file_range from doing remote copies if the
>> source and destination superblocks are different. Only return -EXDEV if
>> they have different fsid (the cluster ID).
>>
>> Signed-off-by: Luis Henriques <[email protected]>
>> ---
>> fs/ceph/file.c | 23 +++++++++++++++++++----
>> 1 file changed, 19 insertions(+), 4 deletions(-)
>>
>> Hi!
>>
>> I've finally managed to run some tests using multiple filesystems, both
>> within a single cluster and also using two different clusters. The
>> behaviour of copy_file_range (with this patch, of course) was what I
>> expected:
>>
>> - Object copies work fine across different filesystems within the same
>> cluster (even with pools in different PGs);
>> - -EXDEV is returned if the fsid is different
>>
>> (OT: I wonder why the cluster ID is named 'fsid'; historical reasons?
>> Because this is actually what's in ceph.conf fsid in "[global]"
>> section. Anyway...)
>>
>> So, what's missing right now is (I always mention this when I have the
>> opportunity!) to merge https://github.com/ceph/ceph/pull/25374 :-)
>> And add the corresponding support for the new flag to the kernel
>> client, of course.
>>
>> Cheers,
>> --
>> Luis
>>
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 685a03cc4b77..88d116893c2b 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>> struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>> struct ceph_cap_flush *prealloc_cf;
>> + struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>> struct ceph_object_locator src_oloc, dst_oloc;
>> struct ceph_object_id src_oid, dst_oid;
>> loff_t endoff = 0, size;
>> @@ -1915,8 +1916,22 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>>
>> if (src_inode == dst_inode)
>> return -EINVAL;
>> - if (src_inode->i_sb != dst_inode->i_sb)
>> - return -EXDEV;
>> + if (src_inode->i_sb != dst_inode->i_sb) {
>> + struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
>> +
>> + if (!src_fsc->client->have_fsid || !dst_fsc->client->have_fsid) {
>> + dout("No fsid in a fs client\n");
>> + return -EXDEV;
>> + }
>
> In what situation is there no fsid? Old cluster version?
>
> If there is no fsid, can we take that to indicate that there is only a
> single filesystem possible in the cluster and that we should attempt the
> copy anyway?

TBH I'm not sure if 'have_fsid' can ever be 'false' in this call. It is
set to 'true' when handling the monmap, and it's never changed back to
'false'. Since I don't think copy_file_range will be invoked *before*
we get the monmap, it should be safe to drop this check. Maybe it could
be replaced it by a WARN_ON()?

Cheers,
--
Luis

>
>> + if (ceph_fsid_compare(&src_fsc->client->fsid,
>> + &dst_fsc->client->fsid)) {
>> + dout("Copying object across different clusters:");
>> + dout(" src fsid: %*ph\n dst fsid: %*ph\n",
>> + 16, &src_fsc->client->fsid,
>> + 16, &dst_fsc->client->fsid);
>> + return -EXDEV;
>> + }
>> + }
>> if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>> return -EROFS;
>>
>> @@ -1928,7 +1943,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> * efficient).
>> */
>>
>> - if (ceph_test_mount_opt(ceph_inode_to_client(src_inode), NOCOPYFROM))
>> + if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
>> return -EOPNOTSUPP;
>>
>> if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
>> @@ -2044,7 +2059,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> dst_ci->i_vino.ino, dst_objnum);
>> /* Do an object remote copy */
>> err = ceph_osdc_copy_from(
>> - &ceph_inode_to_client(src_inode)->client->osdc,
>> + &src_fsc->client->osdc,
>> src_ci->i_vino.snap, 0,
>> &src_oid, &src_oloc,
>> CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |

2019-09-08 19:34:52

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] ceph: allow object copies across different filesystems in the same cluster

On Fri, 2019-09-06 at 17:26 +0100, Luis Henriques wrote:
> "Jeff Layton" <[email protected]> writes:
>
> > On Fri, 2019-09-06 at 14:57 +0100, Luis Henriques wrote:
> > > OSDs are able to perform object copies across different pools. Thus,
> > > there's no need to prevent copy_file_range from doing remote copies if the
> > > source and destination superblocks are different. Only return -EXDEV if
> > > they have different fsid (the cluster ID).
> > >
> > > Signed-off-by: Luis Henriques <[email protected]>
> > > ---
> > > fs/ceph/file.c | 23 +++++++++++++++++++----
> > > 1 file changed, 19 insertions(+), 4 deletions(-)
> > >
> > > Hi!
> > >
> > > I've finally managed to run some tests using multiple filesystems, both
> > > within a single cluster and also using two different clusters. The
> > > behaviour of copy_file_range (with this patch, of course) was what I
> > > expected:
> > >
> > > - Object copies work fine across different filesystems within the same
> > > cluster (even with pools in different PGs);
> > > - -EXDEV is returned if the fsid is different
> > >
> > > (OT: I wonder why the cluster ID is named 'fsid'; historical reasons?
> > > Because this is actually what's in ceph.conf fsid in "[global]"
> > > section. Anyway...)
> > >
> > > So, what's missing right now is (I always mention this when I have the
> > > opportunity!) to merge https://github.com/ceph/ceph/pull/25374 :-)
> > > And add the corresponding support for the new flag to the kernel
> > > client, of course.
> > >
> > > Cheers,
> > > --
> > > Luis
> > >
> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > index 685a03cc4b77..88d116893c2b 100644
> > > --- a/fs/ceph/file.c
> > > +++ b/fs/ceph/file.c
> > > @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > struct ceph_inode_info *src_ci = ceph_inode(src_inode);
> > > struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
> > > struct ceph_cap_flush *prealloc_cf;
> > > + struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> > > struct ceph_object_locator src_oloc, dst_oloc;
> > > struct ceph_object_id src_oid, dst_oid;
> > > loff_t endoff = 0, size;
> > > @@ -1915,8 +1916,22 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >
> > > if (src_inode == dst_inode)
> > > return -EINVAL;
> > > - if (src_inode->i_sb != dst_inode->i_sb)
> > > - return -EXDEV;
> > > + if (src_inode->i_sb != dst_inode->i_sb) {
> > > + struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> > > +
> > > + if (!src_fsc->client->have_fsid || !dst_fsc->client->have_fsid) {
> > > + dout("No fsid in a fs client\n");
> > > + return -EXDEV;
> > > + }
> >
> > In what situation is there no fsid? Old cluster version?
> >
> > If there is no fsid, can we take that to indicate that there is only a
> > single filesystem possible in the cluster and that we should attempt the
> > copy anyway?
>
> TBH I'm not sure if 'have_fsid' can ever be 'false' in this call. It is
> set to 'true' when handling the monmap, and it's never changed back to
> 'false'. Since I don't think copy_file_range will be invoked *before*
> we get the monmap, it should be safe to drop this check. Maybe it could
> be replaced it by a WARN_ON()?
>

Yeah. I think the have_fsid flag just allows us to avoid the pr_err msg
in ceph_check_fsid when the client is initially created. Maybe there is
some better way to achieve that?

In any case, I'd just drop that condition here.
--
Jeff Layton <[email protected]>

2019-09-10 04:33:49

by Luis Henriques

[permalink] [raw]

Subject: Re: [PATCH] ceph: allow object copies across different filesystems in the same cluster

"Jeff Layton" <[email protected]> writes:

> On Fri, 2019-09-06 at 17:26 +0100, Luis Henriques wrote:
>> "Jeff Layton" <[email protected]> writes:
>>
>> > On Fri, 2019-09-06 at 14:57 +0100, Luis Henriques wrote:
>> > > OSDs are able to perform object copies across different pools. Thus,
>> > > there's no need to prevent copy_file_range from doing remote copies if the
>> > > source and destination superblocks are different. Only return -EXDEV if
>> > > they have different fsid (the cluster ID).
>> > >
>> > > Signed-off-by: Luis Henriques <[email protected]>
>> > > ---
>> > > fs/ceph/file.c | 23 +++++++++++++++++++----
>> > > 1 file changed, 19 insertions(+), 4 deletions(-)
>> > >
>> > > Hi!
>> > >
>> > > I've finally managed to run some tests using multiple filesystems, both
>> > > within a single cluster and also using two different clusters. The
>> > > behaviour of copy_file_range (with this patch, of course) was what I
>> > > expected:
>> > >
>> > > - Object copies work fine across different filesystems within the same
>> > > cluster (even with pools in different PGs);
>> > > - -EXDEV is returned if the fsid is different
>> > >
>> > > (OT: I wonder why the cluster ID is named 'fsid'; historical reasons?
>> > > Because this is actually what's in ceph.conf fsid in "[global]"
>> > > section. Anyway...)
>> > >
>> > > So, what's missing right now is (I always mention this when I have the
>> > > opportunity!) to merge https://github.com/ceph/ceph/pull/25374 :-)
>> > > And add the corresponding support for the new flag to the kernel
>> > > client, of course.
>> > >
>> > > Cheers,
>> > > --
>> > > Luis
>> > >
>> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> > > index 685a03cc4b77..88d116893c2b 100644
>> > > --- a/fs/ceph/file.c
>> > > +++ b/fs/ceph/file.c
>> > > @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> > > struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>> > > struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>> > > struct ceph_cap_flush *prealloc_cf;
>> > > + struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
>> > > struct ceph_object_locator src_oloc, dst_oloc;
>> > > struct ceph_object_id src_oid, dst_oid;
>> > > loff_t endoff = 0, size;
>> > > @@ -1915,8 +1916,22 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> > >
>> > > if (src_inode == dst_inode)
>> > > return -EINVAL;
>> > > - if (src_inode->i_sb != dst_inode->i_sb)
>> > > - return -EXDEV;
>> > > + if (src_inode->i_sb != dst_inode->i_sb) {
>> > > + struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
>> > > +
>> > > + if (!src_fsc->client->have_fsid || !dst_fsc->client->have_fsid) {
>> > > + dout("No fsid in a fs client\n");
>> > > + return -EXDEV;
>> > > + }
>> >
>> > In what situation is there no fsid? Old cluster version?
>> >
>> > If there is no fsid, can we take that to indicate that there is only a
>> > single filesystem possible in the cluster and that we should attempt the
>> > copy anyway?
>>
>> TBH I'm not sure if 'have_fsid' can ever be 'false' in this call. It is
>> set to 'true' when handling the monmap, and it's never changed back to
>> 'false'. Since I don't think copy_file_range will be invoked *before*
>> we get the monmap, it should be safe to drop this check. Maybe it could
>> be replaced it by a WARN_ON()?
>>
>
> Yeah. I think the have_fsid flag just allows us to avoid the pr_err msg
> in ceph_check_fsid when the client is initially created. Maybe there is
> some better way to achieve that?

I guess the struct ceph_fsid embedded in the client(s) could be changed
into a pointer initialized to NULL (and later dynamically allocated).
Then, the have_fsid check could be replaced by a NULL check. Not sure
if it would bring any real benefit, though. Want me to give that a try?
Or maybe I misunderstood you question.

> In any case, I'd just drop that condition here.

Ok, I'll send v2 in a second, without this check.

[ BTW, looks like my initial post didn't made it into vger.kernel.org.
It was probably dropped because I screwed-up the 'To:' field in my
email (no idea how I did that, TBH). ]

Cheers,
--
Luis