2012-07-25 07:32:15

by Peng Tao

[permalink] [raw]
Subject: pnfs LD partial sector write

Hi Boaz,

Sorry about the long delay. I had some internal interrupt. Now I'm
looking at the partial LD write problem again. Instead of trying to
bail out unaligned writes blindly, this time I want to fix the write
code to handle partial write as you suggested before. However, it
seems to be more problematic than I used to think.

The dirty range of a page passed to LD->write_pagelist may be
unaligned to sector size, in which case block layer cannot handle it
correctly. Even worse, I cannot do a read-modify-write cycle within
the same page because bio would read in the entire sector and thus
ruin user data within the same sector. Currently I'm thinking of
creating shadow pages for partial sector write and use them to read in
the sector and copy necessary data into user pages. But it is way too
tricky and I don't feel like it at all. So I want to ask how you solve
the partial sector write problem in object layout driver.

I looked at the ore code and found that you are using bio to deal with
partial page read/write as well. But in places like _add_to_r4w(), I
don't see how partial sectors are handled. Maybe I was misreading the
code. Would you please shed some light? More specifically, how does
object layout driver handle partial sector writers like in bellow
simple testcase? Thanks in advance.

--
Best,
Tao


flock-partial-write.c:

#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>

int main(char argc, char **argv)
{
int fd, i, offset = 666, len = 777;
char buf[4096], buf_v[4096];
struct flock lock;

if (argc != 2) {
fprintf(stderr, "Usage: %s [filename]\n", argv[0]);
return -1;
}

memset(buf, 'A', sizeof(buf));

if ((fd = open(argv[1], O_CREAT|O_RDWR, 0644)) < 0) {
perror("open fail");
return -1;
}

if (write(fd, buf, sizeof(buf)) < sizeof(buf)) {
perror("write fail");
return -1;
}

close(fd);

system("echo 1 > /proc/sys/vm/drop_caches");

memset(buf + offset, 'B', len);
memcpy(buf_v, buf, sizeof(buf_v));

if ((fd = open(argv[1], O_WRONLY)) < 0) {
perror("open fail");
return -1;
}

lock.l_type = F_WRLCK;
lock.l_whence = SEEK_SET;
lock.l_start = offset;
lock.l_len = len;

if (fcntl(fd, F_SETLKW, &lock) < 0) {
perror("lock fail");
return -1;
}

if (lseek(fd, offset, SEEK_SET) < 0) {
perror("seek fail");
return -1;
}

if (write(fd, buf + offset, len) < len) {
perror("write fail");
return -1;
}

lock.l_type = F_UNLCK;
fcntl(fd, F_SETLK, &lock);

close(fd);

if ((fd = open(argv[1], O_RDONLY)) < 0) {
perror("open fail");
return -1;
}

if (read(fd, buf, sizeof(buf)) < sizeof(buf)) {
perror("read fail");
return -1;
}

if (memcmp(buf, buf_v, sizeof(buf)) != 0) {
fprintf(stderr, "aha, buf not match\n");
for (i = 0; i < sizeof(buf); i++) {
if (buf[i] != buf_v[i])
fprintf(stderr, "%dth %c vs %c\n", i, buf[i], buf_v[i]);
}
} else {
printf("nice done!\n");
}

close(fd);
return 0;
}


2012-07-26 07:29:48

by Boaz Harrosh

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On 07/26/2012 05:43 AM, Peng Tao wrote:

> On Thu, Jul 26, 2012 at 4:29 AM, Boaz Harrosh <[email protected]> wrote:
>> On 07/25/2012 05:43 PM, Peng Tao wrote:
>>
>>> On Wed, Jul 25, 2012 at 6:28 PM, Boaz Harrosh <[email protected]> wrote:
>>>> On 07/25/2012 10:31 AM, Peng Tao wrote:
>>>>
>>>>> Hi Boaz,
>>>>>
>>>>> Sorry about the long delay. I had some internal interrupt. Now I'm
>>>>> looking at the partial LD write problem again. Instead of trying to
>>>>> bail out unaligned writes blindly, this time I want to fix the write
>>>>> code to handle partial write as you suggested before. However, it
>>>>> seems to be more problematic than I used to think.
>>>>>
>>>>> The dirty range of a page passed to LD->write_pagelist may be
>>>>> unaligned to sector size, in which case block layer cannot handle it
>>>>> correctly. Even worse, I cannot do a read-modify-write cycle within
>>>>> the same page because bio would read in the entire sector and thus
>>>>> ruin user data within the same sector. Currently I'm thinking of
>>>>> creating shadow pages for partial sector write and use them to read in
>>>>> the sector and copy necessary data into user pages. But it is way too
>>>>> tricky and I don't feel like it at all. So I want to ask how you solve
>>>>> the partial sector write problem in object layout driver.
>>>>>
>>>>> I looked at the ore code and found that you are using bio to deal with
>>>>> partial page read/write as well. But in places like _add_to_r4w(), I
>>>>> don't see how partial sectors are handled. Maybe I was misreading the
>>>>> code. Would you please shed some light? More specifically, how does
>>>>> object layout driver handle partial sector writers like in bellow
>>>>> simple testcase? Thanks in advance.
>>>>>
>>>>
>>>>
>>>> The objlayout does not have this problem. OSD-SCSI is a byte aligned
>>>> protocol, unlike DISK-SCSI.
>>>>
>>> aha, I see. So this is blocklayout only problem.
>>>
>>>> The code you are looking for is at _add_to_r4w_first_page() &&
>>>> _add_to_r4w_last_page. But as I said I just submit a read of:
>>>> 0 => offset within the page
>>>> What ever that might be.
>>>>
>>>> In your case: why? all you have to do is allocate 2 sectors (1k) at
>>>> most one for partial sector at end and one for partial sector at
>>>> beginning. And use chained BIOs then memcpy at most [1k -2] bytes.
>>>>
>>>> What you do is chain a single-sector BIO to an all aligned BIO
>>>>
>>> Yeah, it is exactly what I mean by "shadow pages" except for the
>>> chained BIO part. I said "shadow pages" because I need to create one
>>> or two pages to construct bio_vec to do the full sector sync read, and
>>> the pages cannot be attached to inode address space (that's why
>>> "shadow" :-).
>>>
>>> I asked because I don't like the solution and thought maybe there is
>>> better method in object layout and I didn't find it in object code.
>>> Now that it is a blocklayout only problem, I guess I'll have to do the
>>> full sector sync reads tricks.
>>>
>>>> You do the following:
>>>>
>>>> - You will need to preform two reads, right? One for the unaligned
>>>> BLOCK at the begging and one for the BLOCK at the end. Since in
>>>> blocklayout all IO is BLOCK aligned.
>>>>
>>>> Beginning end of IO
>>>> - Jump over first unaligned SECTOR. Prepare BIO from first full
>>>> sector, to the end of the BLOCK.
>>>> - Prepare a 1-biovec BIO from the above allocated sector, which
>>>> reads the full first sector.
>>>> - perpend the 1-vec BIO to the big one.
>>>> - preform the read
>>>> - memcpy from above allocated sector the 0=>offset part into the
>>>> NFS original page.
>>>>
>>>> Do the same for end of IO but for the very last unaligned sector.
>>>> Chain 1-vec BIO to the end this time. memcpy last_byte=>end-of-sector
>>>> part.
>>>>
>>>> So you see no shadow pages and not so complicated. In the unaligned
>>>> case at most you need allocate 1k and chain BIOs at beginning and/or
>>>> at end.
>>>>
>>>> Tell me if you need help with BIO chaining. The 1-vec BIO just use
>>>> bio_kmalloc().
>>>>
>>> yeah, I do have a question on the BIO chaining thing. IMO, I need to
>>> do one or two sync full sector reads, and memcpy the data in the pages
>>> to fill original NFS page into sector aligned. And then I can issue
>>> the sector aligned writes to write out all nfs pages. So I don't quite
>>> get it when you say "perpend the 1-vec BIO to the big one", because
>>> the sector aligned writes (the big one) must be submitted _after_ the
>>> full sector sync reads and memcpy. Would you explain it a bit?
>>>
>>
>>
>> I'm not sure if that is what you meant but I thought you need to write
>> as part of the original IO also the reminder of the last and fist BLOCKs
>>
>> BLOCK means the unit set by the MDS as the atomic IO operation of any
>> IO. If not a full BLOCK is written then the read-layout needs to be used
>> to copy the un written parts of the BLOCK into the write layout.
>>
> Not sure about objectlayout, but for block layout, we really don't
> have to always read/write in BLOCK size. BLOCK is just a minimal
> traceable extent and it is all about extent state that we care about.
> If it is a read-write extent (which is the common case for rewrite),
> blocklayout client can do whatever size of IO as long as the
> underlying hardware supports (in DISK-SCSI case, SECTOR size).
>
>> And that BLOCK can be bigger then a page (multiple of pages) and therefore
>> also bigger then a sector (512 bytes).
>>
>> [In objects layout RFC the stripe_unit is not mandatory a multiple of
>> PAGE_SIZE, but if it is not so, we return error at alloc_lseg and use
>> MDS. I hope it is the same for blocklayout. BLOCK if bigger then
>> PAGE_SIZE should be multiple of. If not revert to MDS IO]
>>
>> So this is what I see. Say BLOCK is two pages.
>>
>> The complete IO will look like:
>>
>> .....| block 0 || block 1 || block 2 || block 3 |......
>> .....|page 0|page 1||page 2|page 3||page 4|page 5||page 6|page 7|......
>> ^ ^ ^ ^
>> | |<--------------------------------->| |
>> | NFS-IO-start NFS-IO-end |
>> | | | |
>> | | | |
>> |<-read I->| |<-read II->|
>> |<-------------------------------------------------------->|
>> Written-IO-start Written-IO-end
>>
>> Note that the best is if all these pages above, before the write
>> operation, are at page-cache if not it is asking for trouble.
>>
>> lets zoom into the first block. (The same at last block but opposite)
>>
>> .....| block 0 |......
>> .....| page 0 | page 1 |......
>> .....| sec0 | sec1 | sec2 | sec3 | sec4 | sec5 | sec6 | sec7 |......
>> ^ ^
>> | |----------......
>> | NFS-IO-start
>> |<----------------read I--------------------->|
>> |<----------------BIO_A------------------>| |
>> |<->| <---- memcpy-part
>> BIO_B---> |<--->|
>>
>> (Sorry I put 4 sectors per page, it is 8, but the principle is the same)
> Thanks a lot for the graph, it is very impressive and helps me a lot
> in understanding your idea.
>
>>
>> You can not submit an IO read as one BIO into the original cache pages
>> because sec6 above will be needed to be read complete and this will
>> over-write the good part of sec6 which has valid data.
>>
>> So you make one BIO_A with sec0-5 which point to original page-cache pages.
>> You make a second BIO_B which points to a side buffer of a the full sec6, and
>> you chain them. ie:
>> BIO_A->bi_next = BIO_B (This is what I mean post-pend)
>>
> As I explained above, block layout client doesn't have to read sec0-5,
> if extent is already read-write. Just when extent is invalid and if
> there is a copy-on-write extent, client need to read in data from the
> cow extent. And the BIO chaining thing is really unnecessary IMHO. In
> cases client need to read in from cow extent, I can just use a single
> BIO to read in sec0-6 and memcpy sec4-5 and part of sec6 into the
> original nfs page.
>
> It's not complicated. I have already cooked the patch. Will send it
> out later today after more testing. It's just that I don't like the
> solution, because I'll have to allocate more pages to construct
> bio_vec to do read. It is an extra effort especially in memory reclaim
> writeback case. Maybe I should make sure single page writeback don't
> go through block layout LD.
>
>> - Now submit the one read, two BIOs chained.
>> - Do the same for the NFS-IO-end above, also one read 2 BIOs chained
>>
>> - Wait for both reads to return
>>
>> - Then you memcpy sec6 0 to offset%sector_size into original cache page.
>> - Same for the end part, last_byte to sector_end
>>
>> - Now you can submit the full write.
>>
>> Both page 0 and page 1 can be marked as uptodate. But most important
>> page 0 was not in cache before the preparation of the read, it must
>> be marked as PageUptodate().
>>
> Another thing is, this further complicates direct writes, where I
> cannot use pagecache to ensure proper locking for concurrent writers
> in the same BLOCK, and sector-aligned partial BLOCK DIO writes need to
> be serialized internally. IOW, the same code cannot be reused by DIO
> writes. sigh...
>


Crap, you did not understand my idea. Because in my plan all IO is
done on page-cache pages, and or NFS pages, *ALL*. Even with the sec6 case
above, page 1 is directly IOed and locked normally. only the single sector6
is not.

You go head and say, "yes I have a solution just like you that allocates
multiple pages and IOs and copies" , "But I don't like the allcations ...."

But this is exactly the opposite of my plan. In my plan you only allocate
*at most* 2 sector. If you are concern about mem pressure just make a mempool
of 512 byte units, and have 2 spare and you are done. (That's how scsi works)

My demonstration above was for the worst case where "when extent is invalid and
if there is a copy-on-write extent"

Of course when you don't need that then all you need is the single sector read
of above BIO_B and the copy.

All during the IO, all pages are locked as before specifically page 1 above
which holds sec6.

I will not continue with these explanations, because clearly you are not listening
to me, and you have your own code in mind, so what is the use?

Good luck
Boaz

2012-07-25 10:28:24

by Boaz Harrosh

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On 07/25/2012 10:31 AM, Peng Tao wrote:

> Hi Boaz,
>
> Sorry about the long delay. I had some internal interrupt. Now I'm
> looking at the partial LD write problem again. Instead of trying to
> bail out unaligned writes blindly, this time I want to fix the write
> code to handle partial write as you suggested before. However, it
> seems to be more problematic than I used to think.
>
> The dirty range of a page passed to LD->write_pagelist may be
> unaligned to sector size, in which case block layer cannot handle it
> correctly. Even worse, I cannot do a read-modify-write cycle within
> the same page because bio would read in the entire sector and thus
> ruin user data within the same sector. Currently I'm thinking of
> creating shadow pages for partial sector write and use them to read in
> the sector and copy necessary data into user pages. But it is way too
> tricky and I don't feel like it at all. So I want to ask how you solve
> the partial sector write problem in object layout driver.
>
> I looked at the ore code and found that you are using bio to deal with
> partial page read/write as well. But in places like _add_to_r4w(), I
> don't see how partial sectors are handled. Maybe I was misreading the
> code. Would you please shed some light? More specifically, how does
> object layout driver handle partial sector writers like in bellow
> simple testcase? Thanks in advance.
>


The objlayout does not have this problem. OSD-SCSI is a byte aligned
protocol, unlike DISK-SCSI.

The code you are looking for is at _add_to_r4w_first_page() &&
_add_to_r4w_last_page. But as I said I just submit a read of:
0 => offset within the page
What ever that might be.

In your case: why? all you have to do is allocate 2 sectors (1k) at
most one for partial sector at end and one for partial sector at
beginning. And use chained BIOs then memcpy at most [1k -2] bytes.

What you do is chain a single-sector BIO to an all aligned BIO

You do the following:

- You will need to preform two reads, right? One for the unaligned
BLOCK at the begging and one for the BLOCK at the end. Since in
blocklayout all IO is BLOCK aligned.

Beginning end of IO
- Jump over first unaligned SECTOR. Prepare BIO from first full
sector, to the end of the BLOCK.
- Prepare a 1-biovec BIO from the above allocated sector, which
reads the full first sector.
- perpend the 1-vec BIO to the big one.
- preform the read
- memcpy from above allocated sector the 0=>offset part into the
NFS original page.

Do the same for end of IO but for the very last unaligned sector.
Chain 1-vec BIO to the end this time. memcpy last_byte=>end-of-sector
part.

So you see no shadow pages and not so complicated. In the unaligned
case at most you need allocate 1k and chain BIOs at beginning and/or
at end.

Tell me if you need help with BIO chaining. The 1-vec BIO just use
bio_kmalloc().

Cheers
Boaz

2012-07-26 16:00:42

by Boaz Harrosh

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On 07/26/2012 06:07 PM, Peng Tao wrote:

> On Thu, Jul 26, 2012 at 10:12 PM, Boaz Harrosh <[email protected]> wrote:
>> There is an easy locking solution for DIO which will not cost much
>> for DIO and will cost nothing for buffered IO. You use the page-cache
>> page lock.
>>
>> What you do is grab the zero-page of each block lock before/during writing to
>> any block. So for your example above they will all be serialized by page-zero
>> lock.
> Yeah, I agree this can work. But I'd prefer not to mix DIO with buffer
> IO, which is often error prone. If in any case I need to serialize
> AIODIO, I'd prefer to do it in easier ways like locking invalid
> extents etc, without messing with page cache.
>


Ye, just keep it BLOCK aligned and that's it. Apps will learn fast enough.
Simple is always better.

Currently I support any alignment but I might do the same in objlayout in
the raid5/6 case. and DIO

<>

> Or maybe somehow through statfs(2), since the blocksize attribute is
> actually a file system's attribute instead of block device's.
>


Good point!!
statfs->f_bsize

man statfs:
long f_bsize; /* optimal transfer block size */

What does NFS return in there now? maybe let LD override on that?

I'll support a patch as such we could use it as well.

Cheers
Boaz

2012-07-26 15:07:53

by Peng Tao

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On Thu, Jul 26, 2012 at 10:12 PM, Boaz Harrosh <[email protected]> wrote:
> There is an easy locking solution for DIO which will not cost much
> for DIO and will cost nothing for buffered IO. You use the page-cache
> page lock.
>
> What you do is grab the zero-page of each block lock before/during writing to
> any block. So for your example above they will all be serialized by page-zero
> lock.
Yeah, I agree this can work. But I'd prefer not to mix DIO with buffer
IO, which is often error prone. If in any case I need to serialize
AIODIO, I'd prefer to do it in easier ways like locking invalid
extents etc, without messing with page cache.

>
> Of course you need like before to flush the page-cache pages before DIO and
> invalidate all pages (NotUpToDate). You keep at least one page in page-cache
> per block, but during DIO it will always be in Not-Up-To-Date empty state.
>
> Then if needed, like example above the first time COW you still do through
> page-cache
>
> *
> * That said I think your solution for only allowing BLOCK aligned DIO is good
> * Applications should learn. They should however find out what BLOCK size is.
> *
>
> You could keep the proper info at the DM device you create for each device_id
> See here: http://people.redhat.com/msnitzer/docs/io-limits.txt
> The "logical_block_size" should be the proper BLOCK size above.
>
Or maybe somehow through statfs(2), since the blocksize attribute is
actually a file system's attribute instead of block device's.

--
Thanks,
Tao

2012-07-26 15:30:21

by Peng Tao

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On Thu, Jul 26, 2012 at 10:30 PM, Boaz Harrosh <[email protected]> wrote:
> On 07/26/2012 04:57 PM, Peng Tao wrote:
>
>> On Thu, Jul 26, 2012 at 8:16 PM, Boaz Harrosh <[email protected]> wrote:
>>> On 07/26/2012 11:25 AM, Peng Tao wrote:
>>>
>>>> For these two sectors, I need to allocate two pages... Just look at
>>>> struct bio_vec.
>>>>
>>>
>>>
>>> NO! I know all about bio_vecs
>>>
>>> You need 1024 bytes, and 2 x one entry BIOs which is a few bytes, where
>>> did you get the "two pages" from?
>>>
>> What do you put int bio_vec->bv_page? Even if you just use 512 bytes
>> of a page, it is still allocated page.
>>
>
>
> No!!
>
> You just use bio_map_kern or in one go blk_rq_map_kern() with any: kmalloc,
> stack, or kernel pointer. And that's that. It will take what it will take.
>
First I should admit I don't know bio_map_kern() alike. Thanks for
teaching me about them (see, here is your credit :)

Looking at them, I don't think it is proper to use them in block
layout code, mainly because It is a layer violation. Searching for
callers of io_map_kern/blk_rq_map_kern(), they are either block core
code or device drivers. Clearly these interfaces are block layer
internal APIs and file systems shouldn't touch them.

Therefore I still think I should stick with plain common
bio_alloc/bio_add_page/submit_bio interfaces. Thanks for the
suggestion.


Cheers,
Tao

2012-07-25 10:45:55

by Boaz Harrosh

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On 07/25/2012 01:28 PM, Boaz Harrosh wrote:

> On 07/25/2012 10:31 AM, Peng Tao wrote:
>
>> Hi Boaz,
>>
>> Sorry about the long delay. I had some internal interrupt. Now I'm
>> looking at the partial LD write problem again. Instead of trying to
>> bail out unaligned writes blindly, this time I want to fix the write
>> code to handle partial write as you suggested before. However, it
>> seems to be more problematic than I used to think.
>>
>> The dirty range of a page passed to LD->write_pagelist may be
>> unaligned to sector size, in which case block layer cannot handle it
>> correctly. Even worse, I cannot do a read-modify-write cycle within
>> the same page because bio would read in the entire sector and thus
>> ruin user data within the same sector. Currently I'm thinking of
>> creating shadow pages for partial sector write and use them to read in
>> the sector and copy necessary data into user pages. But it is way too
>> tricky and I don't feel like it at all. So I want to ask how you solve
>> the partial sector write problem in object layout driver.
>>
>> I looked at the ore code and found that you are using bio to deal with
>> partial page read/write as well. But in places like _add_to_r4w(), I
>> don't see how partial sectors are handled. Maybe I was misreading the
>> code. Would you please shed some light? More specifically, how does
>> object layout driver handle partial sector writers like in bellow
>> simple testcase? Thanks in advance.
>>
>
>
> The objlayout does not have this problem. OSD-SCSI is a byte aligned
> protocol, unlike DISK-SCSI.
>
> The code you are looking for is at _add_to_r4w_first_page() &&
> _add_to_r4w_last_page. But as I said I just submit a read of:
> 0 => offset within the page
> What ever that might be.
>
> In your case: why? all you have to do is allocate 2 sectors (1k) at
> most one for partial sector at end and one for partial sector at
> beginning. And use chained BIOs then memcpy at most [1k -2] bytes.
>
> What you do is chain a single-sector BIO to an all aligned BIO
>
> You do the following:
>
> - You will need to preform two reads, right? One for the unaligned
> BLOCK at the begging and one for the BLOCK at the end. Since in
> blocklayout all IO is BLOCK aligned.
>
> Beginning end of IO
> - Jump over first unaligned SECTOR. Prepare BIO from first full
> sector, to the end of the BLOCK.
> - Prepare a 1-biovec BIO from the above allocated sector, which
> reads the full first sector.
> - perpend the 1-vec BIO to the big one.
> - preform the read
> - memcpy from above allocated sector the 0=>offset part into the
> NFS original page.
>
> Do the same for end of IO but for the very last unaligned sector.
> Chain 1-vec BIO to the end this time. memcpy last_byte=>end-of-sector
> part.
>


Rrr I got this all backwards. You need to read from beginning of
first-BLOCK to offset. Then from last_byte to end-of-last-block. So all
I said above is exactly opposite. Post-pend chained-bio for first-BLOCK
read. And pre-pend chained-bio for last-BLOCK read.

Cheers
Boaz

> So you see no shadow pages and not so complicated. In the unaligned
> case at most you need allocate 1k and chain BIOs at beginning and/or
> at end.
>
> Tell me if you need help with BIO chaining. The 1-vec BIO just use
> bio_kmalloc().
>
> Cheers
> Boaz
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



2012-07-26 13:57:35

by Peng Tao

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On Thu, Jul 26, 2012 at 8:16 PM, Boaz Harrosh <[email protected]> wrote:
> On 07/26/2012 11:25 AM, Peng Tao wrote:
>
>> For these two sectors, I need to allocate two pages... Just look at
>> struct bio_vec.
>>
>
>
> NO! I know all about bio_vecs
>
> You need 1024 bytes, and 2 x one entry BIOs which is a few bytes, where
> did you get the "two pages" from?
>
What do you put int bio_vec->bv_page? Even if you just use 512 bytes
of a page, it is still allocated page.

--
Thanks,
Tao

2012-07-26 08:25:56

by Peng Tao

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On Thu, Jul 26, 2012 at 3:29 PM, Boaz Harrosh <[email protected]> wrote:
> On 07/26/2012 05:43 AM, Peng Tao wrote:
>
>> On Thu, Jul 26, 2012 at 4:29 AM, Boaz Harrosh <[email protected]> wrote:
>>> On 07/25/2012 05:43 PM, Peng Tao wrote:
>>>
>>>> On Wed, Jul 25, 2012 at 6:28 PM, Boaz Harrosh <[email protected]> wrote:
>>>>> On 07/25/2012 10:31 AM, Peng Tao wrote:
>>>>>
>>>>>> Hi Boaz,
>>>>>>
>>>>>> Sorry about the long delay. I had some internal interrupt. Now I'm
>>>>>> looking at the partial LD write problem again. Instead of trying to
>>>>>> bail out unaligned writes blindly, this time I want to fix the write
>>>>>> code to handle partial write as you suggested before. However, it
>>>>>> seems to be more problematic than I used to think.
>>>>>>
>>>>>> The dirty range of a page passed to LD->write_pagelist may be
>>>>>> unaligned to sector size, in which case block layer cannot handle it
>>>>>> correctly. Even worse, I cannot do a read-modify-write cycle within
>>>>>> the same page because bio would read in the entire sector and thus
>>>>>> ruin user data within the same sector. Currently I'm thinking of
>>>>>> creating shadow pages for partial sector write and use them to read in
>>>>>> the sector and copy necessary data into user pages. But it is way too
>>>>>> tricky and I don't feel like it at all. So I want to ask how you solve
>>>>>> the partial sector write problem in object layout driver.
>>>>>>
>>>>>> I looked at the ore code and found that you are using bio to deal with
>>>>>> partial page read/write as well. But in places like _add_to_r4w(), I
>>>>>> don't see how partial sectors are handled. Maybe I was misreading the
>>>>>> code. Would you please shed some light? More specifically, how does
>>>>>> object layout driver handle partial sector writers like in bellow
>>>>>> simple testcase? Thanks in advance.
>>>>>>
>>>>>
>>>>>
>>>>> The objlayout does not have this problem. OSD-SCSI is a byte aligned
>>>>> protocol, unlike DISK-SCSI.
>>>>>
>>>> aha, I see. So this is blocklayout only problem.
>>>>
>>>>> The code you are looking for is at _add_to_r4w_first_page() &&
>>>>> _add_to_r4w_last_page. But as I said I just submit a read of:
>>>>> 0 => offset within the page
>>>>> What ever that might be.
>>>>>
>>>>> In your case: why? all you have to do is allocate 2 sectors (1k) at
>>>>> most one for partial sector at end and one for partial sector at
>>>>> beginning. And use chained BIOs then memcpy at most [1k -2] bytes.
>>>>>
>>>>> What you do is chain a single-sector BIO to an all aligned BIO
>>>>>
>>>> Yeah, it is exactly what I mean by "shadow pages" except for the
>>>> chained BIO part. I said "shadow pages" because I need to create one
>>>> or two pages to construct bio_vec to do the full sector sync read, and
>>>> the pages cannot be attached to inode address space (that's why
>>>> "shadow" :-).
>>>>
>>>> I asked because I don't like the solution and thought maybe there is
>>>> better method in object layout and I didn't find it in object code.
>>>> Now that it is a blocklayout only problem, I guess I'll have to do the
>>>> full sector sync reads tricks.
>>>>
>>>>> You do the following:
>>>>>
>>>>> - You will need to preform two reads, right? One for the unaligned
>>>>> BLOCK at the begging and one for the BLOCK at the end. Since in
>>>>> blocklayout all IO is BLOCK aligned.
>>>>>
>>>>> Beginning end of IO
>>>>> - Jump over first unaligned SECTOR. Prepare BIO from first full
>>>>> sector, to the end of the BLOCK.
>>>>> - Prepare a 1-biovec BIO from the above allocated sector, which
>>>>> reads the full first sector.
>>>>> - perpend the 1-vec BIO to the big one.
>>>>> - preform the read
>>>>> - memcpy from above allocated sector the 0=>offset part into the
>>>>> NFS original page.
>>>>>
>>>>> Do the same for end of IO but for the very last unaligned sector.
>>>>> Chain 1-vec BIO to the end this time. memcpy last_byte=>end-of-sector
>>>>> part.
>>>>>
>>>>> So you see no shadow pages and not so complicated. In the unaligned
>>>>> case at most you need allocate 1k and chain BIOs at beginning and/or
>>>>> at end.
>>>>>
>>>>> Tell me if you need help with BIO chaining. The 1-vec BIO just use
>>>>> bio_kmalloc().
>>>>>
>>>> yeah, I do have a question on the BIO chaining thing. IMO, I need to
>>>> do one or two sync full sector reads, and memcpy the data in the pages
>>>> to fill original NFS page into sector aligned. And then I can issue
>>>> the sector aligned writes to write out all nfs pages. So I don't quite
>>>> get it when you say "perpend the 1-vec BIO to the big one", because
>>>> the sector aligned writes (the big one) must be submitted _after_ the
>>>> full sector sync reads and memcpy. Would you explain it a bit?
>>>>
>>>
>>>
>>> I'm not sure if that is what you meant but I thought you need to write
>>> as part of the original IO also the reminder of the last and fist BLOCKs
>>>
>>> BLOCK means the unit set by the MDS as the atomic IO operation of any
>>> IO. If not a full BLOCK is written then the read-layout needs to be used
>>> to copy the un written parts of the BLOCK into the write layout.
>>>
>> Not sure about objectlayout, but for block layout, we really don't
>> have to always read/write in BLOCK size. BLOCK is just a minimal
>> traceable extent and it is all about extent state that we care about.
>> If it is a read-write extent (which is the common case for rewrite),
>> blocklayout client can do whatever size of IO as long as the
>> underlying hardware supports (in DISK-SCSI case, SECTOR size).
>>
>>> And that BLOCK can be bigger then a page (multiple of pages) and therefore
>>> also bigger then a sector (512 bytes).
>>>
>>> [In objects layout RFC the stripe_unit is not mandatory a multiple of
>>> PAGE_SIZE, but if it is not so, we return error at alloc_lseg and use
>>> MDS. I hope it is the same for blocklayout. BLOCK if bigger then
>>> PAGE_SIZE should be multiple of. If not revert to MDS IO]
>>>
>>> So this is what I see. Say BLOCK is two pages.
>>>
>>> The complete IO will look like:
>>>
>>> .....| block 0 || block 1 || block 2 || block 3 |......
>>> .....|page 0|page 1||page 2|page 3||page 4|page 5||page 6|page 7|......
>>> ^ ^ ^ ^
>>> | |<--------------------------------->| |
>>> | NFS-IO-start NFS-IO-end |
>>> | | | |
>>> | | | |
>>> |<-read I->| |<-read II->|
>>> |<-------------------------------------------------------->|
>>> Written-IO-start Written-IO-end
>>>
>>> Note that the best is if all these pages above, before the write
>>> operation, are at page-cache if not it is asking for trouble.
>>>
>>> lets zoom into the first block. (The same at last block but opposite)
>>>
>>> .....| block 0 |......
>>> .....| page 0 | page 1 |......
>>> .....| sec0 | sec1 | sec2 | sec3 | sec4 | sec5 | sec6 | sec7 |......
>>> ^ ^
>>> | |----------......
>>> | NFS-IO-start
>>> |<----------------read I--------------------->|
>>> |<----------------BIO_A------------------>| |
>>> |<->| <---- memcpy-part
>>> BIO_B---> |<--->|
>>>
>>> (Sorry I put 4 sectors per page, it is 8, but the principle is the same)
>> Thanks a lot for the graph, it is very impressive and helps me a lot
>> in understanding your idea.
>>
>>>
>>> You can not submit an IO read as one BIO into the original cache pages
>>> because sec6 above will be needed to be read complete and this will
>>> over-write the good part of sec6 which has valid data.
>>>
>>> So you make one BIO_A with sec0-5 which point to original page-cache pages.
>>> You make a second BIO_B which points to a side buffer of a the full sec6, and
>>> you chain them. ie:
>>> BIO_A->bi_next = BIO_B (This is what I mean post-pend)
>>>
>> As I explained above, block layout client doesn't have to read sec0-5,
>> if extent is already read-write. Just when extent is invalid and if
>> there is a copy-on-write extent, client need to read in data from the
>> cow extent. And the BIO chaining thing is really unnecessary IMHO. In
>> cases client need to read in from cow extent, I can just use a single
>> BIO to read in sec0-6 and memcpy sec4-5 and part of sec6 into the
>> original nfs page.
>>
>> It's not complicated. I have already cooked the patch. Will send it
>> out later today after more testing. It's just that I don't like the
>> solution, because I'll have to allocate more pages to construct
>> bio_vec to do read. It is an extra effort especially in memory reclaim
>> writeback case. Maybe I should make sure single page writeback don't
>> go through block layout LD.
>>
>>> - Now submit the one read, two BIOs chained.
>>> - Do the same for the NFS-IO-end above, also one read 2 BIOs chained
>>>
>>> - Wait for both reads to return
>>>
>>> - Then you memcpy sec6 0 to offset%sector_size into original cache page.
>>> - Same for the end part, last_byte to sector_end
>>>
>>> - Now you can submit the full write.
>>>
>>> Both page 0 and page 1 can be marked as uptodate. But most important
>>> page 0 was not in cache before the preparation of the read, it must
>>> be marked as PageUptodate().
>>>
>> Another thing is, this further complicates direct writes, where I
>> cannot use pagecache to ensure proper locking for concurrent writers
>> in the same BLOCK, and sector-aligned partial BLOCK DIO writes need to
>> be serialized internally. IOW, the same code cannot be reused by DIO
>> writes. sigh...
>>
>
>
> Crap, you did not understand my idea. Because in my plan all IO is
> done on page-cache pages, and or NFS pages, *ALL*. Even with the sec6 case
> above, page 1 is directly IOed and locked normally. only the single sector6
> is not.
>
> You go head and say, "yes I have a solution just like you that allocates
> multiple pages and IOs and copies" , "But I don't like the allcations ...."
>
> But this is exactly the opposite of my plan. In my plan you only allocate
> *at most* 2 sector. If you are concern about mem pressure just make a mempool
> of 512 byte units, and have 2 spare and you are done. (That's how scsi works)
>
For these two sectors, I need to allocate two pages... Just look at
struct bio_vec.

--
Thanks,
Tao

2012-07-26 12:16:18

by Boaz Harrosh

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On 07/26/2012 11:25 AM, Peng Tao wrote:

> For these two sectors, I need to allocate two pages... Just look at
> struct bio_vec.
>


NO! I know all about bio_vecs

You need 1024 bytes, and 2 x one entry BIOs which is a few bytes, where
did you get the "two pages" from?


2012-07-26 09:13:09

by Peng Tao

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On Thu, Jul 26, 2012 at 3:47 PM, Boaz Harrosh <[email protected]> wrote:
> On 07/26/2012 05:43 AM, Peng Tao wrote:
>
>> Another thing is, this further complicates direct writes, where I
>> cannot use pagecache to ensure proper locking for concurrent writers
>> in the same BLOCK, and sector-aligned partial BLOCK DIO writes need to
>> be serialized internally. IOW, the same code cannot be reused by DIO
>> writes. sigh...
>>
>
>
> One last thing. Applications who use direct IO know to allocate
> and issue sector aligned requests both at offset and length.
> That's a Kernel requirement. It is not for NFS, but even so.
>
> Just refuse sector unaligned DIO and revert to MDS.
>
> With sector aligned IO you directly DIO to DIO pages,
> problem solved.
>
> If you need the COW of partial blocks, you still use
> page-cache pages, which is fine because they do not
> intersect any of the DIO.
>
I certainly thought about it, but it doesn't work for AIO DIO case.
Assuming BLOCK size is 8K, process A write to 0~4095 bytes of file foo
with AIO DIO, at the same time process B write to 4096~8191 with AIO
DIO at the same time. If kernel ever tries to reply on page cache to
cope with invalid extent, it ends up with data corruption.

This is a common problem for any extent based file system to deal with
partial BLOCK (_NOT SECTOR_) AIODIO writes. If you wonder why, take a
look at ext4_unaligned_aio() and all the ext4 AIODIO locking
mechanisms... And that's the reason I bailed out non-block aligned AIO
in previous DIO alignment patches. I think I should just keep the
AIODIO bailout logic since adding locking method is slowing down
writers while they can go locklessly through MDS. I will revive the
bailout patches after fixing the buffer IO things.

Cheers,
Tao

2012-07-26 15:44:45

by Boaz Harrosh

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On 07/26/2012 06:30 PM, Peng Tao wrote:

> On Thu, Jul 26, 2012 at 10:30 PM, Boaz Harrosh <[email protected]> wrote:
>> On 07/26/2012 04:57 PM, Peng Tao wrote:
>>
>>> On Thu, Jul 26, 2012 at 8:16 PM, Boaz Harrosh <[email protected]> wrote:
>>>> On 07/26/2012 11:25 AM, Peng Tao wrote:
>>>>
>>>>> For these two sectors, I need to allocate two pages... Just look at
>>>>> struct bio_vec.
>>>>>
>>>>
>>>>
>>>> NO! I know all about bio_vecs
>>>>
>>>> You need 1024 bytes, and 2 x one entry BIOs which is a few bytes, where
>>>> did you get the "two pages" from?
>>>>
>>> What do you put int bio_vec->bv_page? Even if you just use 512 bytes
>>> of a page, it is still allocated page.
>>>
>>
>>
>> No!!
>>
>> You just use bio_map_kern or in one go blk_rq_map_kern() with any: kmalloc,
>> stack, or kernel pointer. And that's that. It will take what it will take.
>>
> First I should admit I don't know bio_map_kern() alike. Thanks for
> teaching me about them (see, here is your credit :)
>
> Looking at them, I don't think it is proper to use them in block
> layout code, mainly because It is a layer violation. Searching for
> callers of io_map_kern/blk_rq_map_kern(), they are either block core
> code or device drivers. Clearly these interfaces are block layer
> internal APIs and file systems shouldn't touch them.
>
> Therefore I still think I should stick with plain common
> bio_alloc/bio_add_page/submit_bio interfaces. Thanks for the
> suggestion.
>


Crap no!!!

bio_kmalloc/bio_map_kern/submit_bio is just fine. Same exact layer.
same exact BIO API level.

You can mix and match bio_map_kern chained with any bio_add_page
and submit one chain or let the elevator chain them with plug unplug
they are all the same.
If you have pages - bio_add_page;
Have pointers - bio_map_kern.
They are used all over libosd exofs ore scsi_lib and what not.
Any user of block layer, which is what block layout is!

Again some Credit

Boaz

>
> Cheers,
> Tao



2012-07-26 02:44:14

by Peng Tao

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On Thu, Jul 26, 2012 at 4:29 AM, Boaz Harrosh <[email protected]> wrote:
> On 07/25/2012 05:43 PM, Peng Tao wrote:
>
>> On Wed, Jul 25, 2012 at 6:28 PM, Boaz Harrosh <[email protected]> wrote:
>>> On 07/25/2012 10:31 AM, Peng Tao wrote:
>>>
>>>> Hi Boaz,
>>>>
>>>> Sorry about the long delay. I had some internal interrupt. Now I'm
>>>> looking at the partial LD write problem again. Instead of trying to
>>>> bail out unaligned writes blindly, this time I want to fix the write
>>>> code to handle partial write as you suggested before. However, it
>>>> seems to be more problematic than I used to think.
>>>>
>>>> The dirty range of a page passed to LD->write_pagelist may be
>>>> unaligned to sector size, in which case block layer cannot handle it
>>>> correctly. Even worse, I cannot do a read-modify-write cycle within
>>>> the same page because bio would read in the entire sector and thus
>>>> ruin user data within the same sector. Currently I'm thinking of
>>>> creating shadow pages for partial sector write and use them to read in
>>>> the sector and copy necessary data into user pages. But it is way too
>>>> tricky and I don't feel like it at all. So I want to ask how you solve
>>>> the partial sector write problem in object layout driver.
>>>>
>>>> I looked at the ore code and found that you are using bio to deal with
>>>> partial page read/write as well. But in places like _add_to_r4w(), I
>>>> don't see how partial sectors are handled. Maybe I was misreading the
>>>> code. Would you please shed some light? More specifically, how does
>>>> object layout driver handle partial sector writers like in bellow
>>>> simple testcase? Thanks in advance.
>>>>
>>>
>>>
>>> The objlayout does not have this problem. OSD-SCSI is a byte aligned
>>> protocol, unlike DISK-SCSI.
>>>
>> aha, I see. So this is blocklayout only problem.
>>
>>> The code you are looking for is at _add_to_r4w_first_page() &&
>>> _add_to_r4w_last_page. But as I said I just submit a read of:
>>> 0 => offset within the page
>>> What ever that might be.
>>>
>>> In your case: why? all you have to do is allocate 2 sectors (1k) at
>>> most one for partial sector at end and one for partial sector at
>>> beginning. And use chained BIOs then memcpy at most [1k -2] bytes.
>>>
>>> What you do is chain a single-sector BIO to an all aligned BIO
>>>
>> Yeah, it is exactly what I mean by "shadow pages" except for the
>> chained BIO part. I said "shadow pages" because I need to create one
>> or two pages to construct bio_vec to do the full sector sync read, and
>> the pages cannot be attached to inode address space (that's why
>> "shadow" :-).
>>
>> I asked because I don't like the solution and thought maybe there is
>> better method in object layout and I didn't find it in object code.
>> Now that it is a blocklayout only problem, I guess I'll have to do the
>> full sector sync reads tricks.
>>
>>> You do the following:
>>>
>>> - You will need to preform two reads, right? One for the unaligned
>>> BLOCK at the begging and one for the BLOCK at the end. Since in
>>> blocklayout all IO is BLOCK aligned.
>>>
>>> Beginning end of IO
>>> - Jump over first unaligned SECTOR. Prepare BIO from first full
>>> sector, to the end of the BLOCK.
>>> - Prepare a 1-biovec BIO from the above allocated sector, which
>>> reads the full first sector.
>>> - perpend the 1-vec BIO to the big one.
>>> - preform the read
>>> - memcpy from above allocated sector the 0=>offset part into the
>>> NFS original page.
>>>
>>> Do the same for end of IO but for the very last unaligned sector.
>>> Chain 1-vec BIO to the end this time. memcpy last_byte=>end-of-sector
>>> part.
>>>
>>> So you see no shadow pages and not so complicated. In the unaligned
>>> case at most you need allocate 1k and chain BIOs at beginning and/or
>>> at end.
>>>
>>> Tell me if you need help with BIO chaining. The 1-vec BIO just use
>>> bio_kmalloc().
>>>
>> yeah, I do have a question on the BIO chaining thing. IMO, I need to
>> do one or two sync full sector reads, and memcpy the data in the pages
>> to fill original NFS page into sector aligned. And then I can issue
>> the sector aligned writes to write out all nfs pages. So I don't quite
>> get it when you say "perpend the 1-vec BIO to the big one", because
>> the sector aligned writes (the big one) must be submitted _after_ the
>> full sector sync reads and memcpy. Would you explain it a bit?
>>
>
>
> I'm not sure if that is what you meant but I thought you need to write
> as part of the original IO also the reminder of the last and fist BLOCKs
>
> BLOCK means the unit set by the MDS as the atomic IO operation of any
> IO. If not a full BLOCK is written then the read-layout needs to be used
> to copy the un written parts of the BLOCK into the write layout.
>
Not sure about objectlayout, but for block layout, we really don't
have to always read/write in BLOCK size. BLOCK is just a minimal
traceable extent and it is all about extent state that we care about.
If it is a read-write extent (which is the common case for rewrite),
blocklayout client can do whatever size of IO as long as the
underlying hardware supports (in DISK-SCSI case, SECTOR size).

> And that BLOCK can be bigger then a page (multiple of pages) and therefore
> also bigger then a sector (512 bytes).
>
> [In objects layout RFC the stripe_unit is not mandatory a multiple of
> PAGE_SIZE, but if it is not so, we return error at alloc_lseg and use
> MDS. I hope it is the same for blocklayout. BLOCK if bigger then
> PAGE_SIZE should be multiple of. If not revert to MDS IO]
>
> So this is what I see. Say BLOCK is two pages.
>
> The complete IO will look like:
>
> .....| block 0 || block 1 || block 2 || block 3 |......
> .....|page 0|page 1||page 2|page 3||page 4|page 5||page 6|page 7|......
> ^ ^ ^ ^
> | |<--------------------------------->| |
> | NFS-IO-start NFS-IO-end |
> | | | |
> | | | |
> |<-read I->| |<-read II->|
> |<-------------------------------------------------------->|
> Written-IO-start Written-IO-end
>
> Note that the best is if all these pages above, before the write
> operation, are at page-cache if not it is asking for trouble.
>
> lets zoom into the first block. (The same at last block but opposite)
>
> .....| block 0 |......
> .....| page 0 | page 1 |......
> .....| sec0 | sec1 | sec2 | sec3 | sec4 | sec5 | sec6 | sec7 |......
> ^ ^
> | |----------......
> | NFS-IO-start
> |<----------------read I--------------------->|
> |<----------------BIO_A------------------>| |
> |<->| <---- memcpy-part
> BIO_B---> |<--->|
>
> (Sorry I put 4 sectors per page, it is 8, but the principle is the same)
Thanks a lot for the graph, it is very impressive and helps me a lot
in understanding your idea.

>
> You can not submit an IO read as one BIO into the original cache pages
> because sec6 above will be needed to be read complete and this will
> over-write the good part of sec6 which has valid data.
>
> So you make one BIO_A with sec0-5 which point to original page-cache pages.
> You make a second BIO_B which points to a side buffer of a the full sec6, and
> you chain them. ie:
> BIO_A->bi_next = BIO_B (This is what I mean post-pend)
>
As I explained above, block layout client doesn't have to read sec0-5,
if extent is already read-write. Just when extent is invalid and if
there is a copy-on-write extent, client need to read in data from the
cow extent. And the BIO chaining thing is really unnecessary IMHO. In
cases client need to read in from cow extent, I can just use a single
BIO to read in sec0-6 and memcpy sec4-5 and part of sec6 into the
original nfs page.

It's not complicated. I have already cooked the patch. Will send it
out later today after more testing. It's just that I don't like the
solution, because I'll have to allocate more pages to construct
bio_vec to do read. It is an extra effort especially in memory reclaim
writeback case. Maybe I should make sure single page writeback don't
go through block layout LD.

> - Now submit the one read, two BIOs chained.
> - Do the same for the NFS-IO-end above, also one read 2 BIOs chained
>
> - Wait for both reads to return
>
> - Then you memcpy sec6 0 to offset%sector_size into original cache page.
> - Same for the end part, last_byte to sector_end
>
> - Now you can submit the full write.
>
> Both page 0 and page 1 can be marked as uptodate. But most important
> page 0 was not in cache before the preparation of the read, it must
> be marked as PageUptodate().
>
Another thing is, this further complicates direct writes, where I
cannot use pagecache to ensure proper locking for concurrent writers
in the same BLOCK, and sector-aligned partial BLOCK DIO writes need to
be serialized internally. IOW, the same code cannot be reused by DIO
writes. sigh...

--
Thanks,
Tao

2012-07-25 20:30:22

by Boaz Harrosh

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On 07/25/2012 05:43 PM, Peng Tao wrote:

> On Wed, Jul 25, 2012 at 6:28 PM, Boaz Harrosh <[email protected]> wrote:
>> On 07/25/2012 10:31 AM, Peng Tao wrote:
>>
>>> Hi Boaz,
>>>
>>> Sorry about the long delay. I had some internal interrupt. Now I'm
>>> looking at the partial LD write problem again. Instead of trying to
>>> bail out unaligned writes blindly, this time I want to fix the write
>>> code to handle partial write as you suggested before. However, it
>>> seems to be more problematic than I used to think.
>>>
>>> The dirty range of a page passed to LD->write_pagelist may be
>>> unaligned to sector size, in which case block layer cannot handle it
>>> correctly. Even worse, I cannot do a read-modify-write cycle within
>>> the same page because bio would read in the entire sector and thus
>>> ruin user data within the same sector. Currently I'm thinking of
>>> creating shadow pages for partial sector write and use them to read in
>>> the sector and copy necessary data into user pages. But it is way too
>>> tricky and I don't feel like it at all. So I want to ask how you solve
>>> the partial sector write problem in object layout driver.
>>>
>>> I looked at the ore code and found that you are using bio to deal with
>>> partial page read/write as well. But in places like _add_to_r4w(), I
>>> don't see how partial sectors are handled. Maybe I was misreading the
>>> code. Would you please shed some light? More specifically, how does
>>> object layout driver handle partial sector writers like in bellow
>>> simple testcase? Thanks in advance.
>>>
>>
>>
>> The objlayout does not have this problem. OSD-SCSI is a byte aligned
>> protocol, unlike DISK-SCSI.
>>
> aha, I see. So this is blocklayout only problem.
>
>> The code you are looking for is at _add_to_r4w_first_page() &&
>> _add_to_r4w_last_page. But as I said I just submit a read of:
>> 0 => offset within the page
>> What ever that might be.
>>
>> In your case: why? all you have to do is allocate 2 sectors (1k) at
>> most one for partial sector at end and one for partial sector at
>> beginning. And use chained BIOs then memcpy at most [1k -2] bytes.
>>
>> What you do is chain a single-sector BIO to an all aligned BIO
>>
> Yeah, it is exactly what I mean by "shadow pages" except for the
> chained BIO part. I said "shadow pages" because I need to create one
> or two pages to construct bio_vec to do the full sector sync read, and
> the pages cannot be attached to inode address space (that's why
> "shadow" :-).
>
> I asked because I don't like the solution and thought maybe there is
> better method in object layout and I didn't find it in object code.
> Now that it is a blocklayout only problem, I guess I'll have to do the
> full sector sync reads tricks.
>
>> You do the following:
>>
>> - You will need to preform two reads, right? One for the unaligned
>> BLOCK at the begging and one for the BLOCK at the end. Since in
>> blocklayout all IO is BLOCK aligned.
>>
>> Beginning end of IO
>> - Jump over first unaligned SECTOR. Prepare BIO from first full
>> sector, to the end of the BLOCK.
>> - Prepare a 1-biovec BIO from the above allocated sector, which
>> reads the full first sector.
>> - perpend the 1-vec BIO to the big one.
>> - preform the read
>> - memcpy from above allocated sector the 0=>offset part into the
>> NFS original page.
>>
>> Do the same for end of IO but for the very last unaligned sector.
>> Chain 1-vec BIO to the end this time. memcpy last_byte=>end-of-sector
>> part.
>>
>> So you see no shadow pages and not so complicated. In the unaligned
>> case at most you need allocate 1k and chain BIOs at beginning and/or
>> at end.
>>
>> Tell me if you need help with BIO chaining. The 1-vec BIO just use
>> bio_kmalloc().
>>
> yeah, I do have a question on the BIO chaining thing. IMO, I need to
> do one or two sync full sector reads, and memcpy the data in the pages
> to fill original NFS page into sector aligned. And then I can issue
> the sector aligned writes to write out all nfs pages. So I don't quite
> get it when you say "perpend the 1-vec BIO to the big one", because
> the sector aligned writes (the big one) must be submitted _after_ the
> full sector sync reads and memcpy. Would you explain it a bit?
>


I'm not sure if that is what you meant but I thought you need to write
as part of the original IO also the reminder of the last and fist BLOCKs

BLOCK means the unit set by the MDS as the atomic IO operation of any
IO. If not a full BLOCK is written then the read-layout needs to be used
to copy the un written parts of the BLOCK into the write layout.

And that BLOCK can be bigger then a page (multiple of pages) and therefore
also bigger then a sector (512 bytes).

[In objects layout RFC the stripe_unit is not mandatory a multiple of
PAGE_SIZE, but if it is not so, we return error at alloc_lseg and use
MDS. I hope it is the same for blocklayout. BLOCK if bigger then
PAGE_SIZE should be multiple of. If not revert to MDS IO]

So this is what I see. Say BLOCK is two pages.

The complete IO will look like:

.....| block 0 || block 1 || block 2 || block 3 |......
.....|page 0|page 1||page 2|page 3||page 4|page 5||page 6|page 7|......
^ ^ ^ ^
| |<--------------------------------->| |
| NFS-IO-start NFS-IO-end |
| | | |
| | | |
|<-read I->| |<-read II->|
|<-------------------------------------------------------->|
Written-IO-start Written-IO-end

Note that the best is if all these pages above, before the write
operation, are at page-cache if not it is asking for trouble.

lets zoom into the first block. (The same at last block but opposite)

.....| block 0 |......
.....| page 0 | page 1 |......
.....| sec0 | sec1 | sec2 | sec3 | sec4 | sec5 | sec6 | sec7 |......
^ ^
| |----------......
| NFS-IO-start
|<----------------read I--------------------->|
|<----------------BIO_A------------------>| |
|<->| <---- memcpy-part
BIO_B---> |<--->|

(Sorry I put 4 sectors per page, it is 8, but the principle is the same)

You can not submit an IO read as one BIO into the original cache pages
because sec6 above will be needed to be read complete and this will
over-write the good part of sec6 which has valid data.

So you make one BIO_A with sec0-5 which point to original page-cache pages.
You make a second BIO_B which points to a side buffer of a the full sec6, and
you chain them. ie:
BIO_A->bi_next = BIO_B (This is what I mean post-pend)

- Now submit the one read, two BIOs chained.
- Do the same for the NFS-IO-end above, also one read 2 BIOs chained

- Wait for both reads to return

- Then you memcpy sec6 0 to offset%sector_size into original cache page.
- Same for the end part, last_byte to sector_end

- Now you can submit the full write.

Both page 0 and page 1 can be marked as uptodate. But most important
page 0 was not in cache before the preparation of the read, it must
be marked as PageUptodate().

> Thanks,
> Tao


Believe me the code is not much bigger then above explanation, I
hope.

Cheers
Boaz

2012-07-26 07:48:13

by Boaz Harrosh

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On 07/26/2012 05:43 AM, Peng Tao wrote:

> Another thing is, this further complicates direct writes, where I
> cannot use pagecache to ensure proper locking for concurrent writers
> in the same BLOCK, and sector-aligned partial BLOCK DIO writes need to
> be serialized internally. IOW, the same code cannot be reused by DIO
> writes. sigh...
>


One last thing. Applications who use direct IO know to allocate
and issue sector aligned requests both at offset and length.
That's a Kernel requirement. It is not for NFS, but even so.

Just refuse sector unaligned DIO and revert to MDS.

With sector aligned IO you directly DIO to DIO pages,
problem solved.

If you need the COW of partial blocks, you still use
page-cache pages, which is fine because they do not
intersect any of the DIO.

Cheers
Boaz

2012-07-25 14:44:23

by Peng Tao

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On Wed, Jul 25, 2012 at 6:28 PM, Boaz Harrosh <[email protected]> wrote:
> On 07/25/2012 10:31 AM, Peng Tao wrote:
>
>> Hi Boaz,
>>
>> Sorry about the long delay. I had some internal interrupt. Now I'm
>> looking at the partial LD write problem again. Instead of trying to
>> bail out unaligned writes blindly, this time I want to fix the write
>> code to handle partial write as you suggested before. However, it
>> seems to be more problematic than I used to think.
>>
>> The dirty range of a page passed to LD->write_pagelist may be
>> unaligned to sector size, in which case block layer cannot handle it
>> correctly. Even worse, I cannot do a read-modify-write cycle within
>> the same page because bio would read in the entire sector and thus
>> ruin user data within the same sector. Currently I'm thinking of
>> creating shadow pages for partial sector write and use them to read in
>> the sector and copy necessary data into user pages. But it is way too
>> tricky and I don't feel like it at all. So I want to ask how you solve
>> the partial sector write problem in object layout driver.
>>
>> I looked at the ore code and found that you are using bio to deal with
>> partial page read/write as well. But in places like _add_to_r4w(), I
>> don't see how partial sectors are handled. Maybe I was misreading the
>> code. Would you please shed some light? More specifically, how does
>> object layout driver handle partial sector writers like in bellow
>> simple testcase? Thanks in advance.
>>
>
>
> The objlayout does not have this problem. OSD-SCSI is a byte aligned
> protocol, unlike DISK-SCSI.
>
aha, I see. So this is blocklayout only problem.

> The code you are looking for is at _add_to_r4w_first_page() &&
> _add_to_r4w_last_page. But as I said I just submit a read of:
> 0 => offset within the page
> What ever that might be.
>
> In your case: why? all you have to do is allocate 2 sectors (1k) at
> most one for partial sector at end and one for partial sector at
> beginning. And use chained BIOs then memcpy at most [1k -2] bytes.
>
> What you do is chain a single-sector BIO to an all aligned BIO
>
Yeah, it is exactly what I mean by "shadow pages" except for the
chained BIO part. I said "shadow pages" because I need to create one
or two pages to construct bio_vec to do the full sector sync read, and
the pages cannot be attached to inode address space (that's why
"shadow" :-).

I asked because I don't like the solution and thought maybe there is
better method in object layout and I didn't find it in object code.
Now that it is a blocklayout only problem, I guess I'll have to do the
full sector sync reads tricks.

> You do the following:
>
> - You will need to preform two reads, right? One for the unaligned
> BLOCK at the begging and one for the BLOCK at the end. Since in
> blocklayout all IO is BLOCK aligned.
>
> Beginning end of IO
> - Jump over first unaligned SECTOR. Prepare BIO from first full
> sector, to the end of the BLOCK.
> - Prepare a 1-biovec BIO from the above allocated sector, which
> reads the full first sector.
> - perpend the 1-vec BIO to the big one.
> - preform the read
> - memcpy from above allocated sector the 0=>offset part into the
> NFS original page.
>
> Do the same for end of IO but for the very last unaligned sector.
> Chain 1-vec BIO to the end this time. memcpy last_byte=>end-of-sector
> part.
>
> So you see no shadow pages and not so complicated. In the unaligned
> case at most you need allocate 1k and chain BIOs at beginning and/or
> at end.
>
> Tell me if you need help with BIO chaining. The 1-vec BIO just use
> bio_kmalloc().
>
yeah, I do have a question on the BIO chaining thing. IMO, I need to
do one or two sync full sector reads, and memcpy the data in the pages
to fill original NFS page into sector aligned. And then I can issue
the sector aligned writes to write out all nfs pages. So I don't quite
get it when you say "perpend the 1-vec BIO to the big one", because
the sector aligned writes (the big one) must be submitted _after_ the
full sector sync reads and memcpy. Would you explain it a bit?

Thanks,
Tao

2012-07-26 14:13:06

by Boaz Harrosh

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On 07/26/2012 12:12 PM, Peng Tao wrote:

> On Thu, Jul 26, 2012 at 3:47 PM, Boaz Harrosh <[email protected]> wrote:
>> On 07/26/2012 05:43 AM, Peng Tao wrote:
>>
>>> Another thing is, this further complicates direct writes, where I
>>> cannot use pagecache to ensure proper locking for concurrent writers
>>> in the same BLOCK, and sector-aligned partial BLOCK DIO writes need to
>>> be serialized internally. IOW, the same code cannot be reused by DIO
>>> writes. sigh...
>>>
>>
>>
>> One last thing. Applications who use direct IO know to allocate
>> and issue sector aligned requests both at offset and length.
>> That's a Kernel requirement. It is not for NFS, but even so.
>>
>> Just refuse sector unaligned DIO and revert to MDS.
>>
>> With sector aligned IO you directly DIO to DIO pages,
>> problem solved.
>>
>> If you need the COW of partial blocks, you still use
>> page-cache pages, which is fine because they do not
>> intersect any of the DIO.
>>
> I certainly thought about it, but it doesn't work for AIO DIO case.
> Assuming BLOCK size is 8K, process A write to 0~4095 bytes of file foo
> with AIO DIO, at the same time process B write to 4096~8191 with AIO
> DIO at the same time. If kernel ever tries to reply on page cache to
> cope with invalid extent, it ends up with data corruption.
>
> This is a common problem for any extent based file system to deal with
> partial BLOCK (_NOT SECTOR_) AIODIO writes. If you wonder why, take a
> look at ext4_unaligned_aio() and all the ext4 AIODIO locking
> mechanisms... And that's the reason I bailed out non-block aligned AIO
> in previous DIO alignment patches. I think I should just keep the
> AIODIO bailout logic since adding locking method is slowing down
> writers while they can go locklessly through MDS. I will revive the
> bailout patches after fixing the buffer IO things.
>


There is an easy locking solution for DIO which will not cost much
for DIO and will cost nothing for buffered IO. You use the page-cache
page lock.

What you do is grab the zero-page of each block lock before/during writing to
any block. So for your example above they will all be serialized by page-zero
lock.

Of course you need like before to flush the page-cache pages before DIO and
invalidate all pages (NotUpToDate). You keep at least one page in page-cache
per block, but during DIO it will always be in Not-Up-To-Date empty state.

Then if needed, like example above the first time COW you still do through
page-cache

*
* That said I think your solution for only allowing BLOCK aligned DIO is good
* Applications should learn. They should however find out what BLOCK size is.
*

You could keep the proper info at the DM device you create for each device_id
See here: http://people.redhat.com/msnitzer/docs/io-limits.txt
The "logical_block_size" should be the proper BLOCK size above.

But we will need to talk about how to associate files with device_id's.
Perhaps at /proc or some IOCTL. There was also talks going on in the
Luster/cluster filesystems people about defining a FILE_LAYOUT xattr, which
will return these layout information in a generic way, for the likes of tar,
backup, and sorts.

> Cheers,
> Tao


Cheers
Boaz

2012-07-26 14:30:42

by Boaz Harrosh

[permalink] [raw]
Subject: Re: pnfs LD partial sector write

On 07/26/2012 04:57 PM, Peng Tao wrote:

> On Thu, Jul 26, 2012 at 8:16 PM, Boaz Harrosh <[email protected]> wrote:
>> On 07/26/2012 11:25 AM, Peng Tao wrote:
>>
>>> For these two sectors, I need to allocate two pages... Just look at
>>> struct bio_vec.
>>>
>>
>>
>> NO! I know all about bio_vecs
>>
>> You need 1024 bytes, and 2 x one entry BIOs which is a few bytes, where
>> did you get the "two pages" from?
>>
> What do you put int bio_vec->bv_page? Even if you just use 512 bytes
> of a page, it is still allocated page.
>


No!!

You just use bio_map_kern or in one go blk_rq_map_kern() with any: kmalloc,
stack, or kernel pointer. And that's that. It will take what it will take.

Two such BIOs can use the same page different regions, or a small region
sharing a page with other kmalloc allocations.

I don't see how you got your idea from?

And for the bio itself you use bio_kmalloc(GFP_KERNEL/GFP_NOIO, numentries);
which will give you one BIO + bio_vects in one allocation.

And that is that

You should give me more credit
Boaz