Issuing layoutget at .pg_init will drop the IO size information and ask for 4KB
layout every time. However, the IO size information is very valuable for MDS to
determine how much layout it should return to client.
The patchset try to allow LD not to send layoutget at .pg_init but instead at
pnfs_do_multiple_writes. So that real IO size is preserved and sent to MDS.
Tests against a server that does not aggressively pre-allocate layout, shows
that the IO size informantion is really useful to block layout MDS.
The generic pnfs layer changes are trival to file layout and object as long as
they still send layoutget at .pg_init.
iozone cmd:
./iozone -r 1m -s 4G -w -W -c -t 10 -i 0 -F /mnt/iozone.data.1 /mnt/iozone.data.2 /mnt/iozone.data.3 /mnt/iozone.data.4 /mnt/iozone.data.5 /mnt/iozone.data.6 /mnt/iozone.data.7 /mnt/iozone.data.8 /mnt/iozone.data.9 /mnt/iozone.data.10
Befor patch: around 12MB/s throughput
After patch: around 72MB/s throughput
Peng Tao (4):
nfsv41: export pnfs_find_alloc_layout
nfsv41: add and export pnfs_find_get_layout_locked
nfsv41: get lseg before issue LD IO if pgio doesn't carry lseg
pnfsblock: do ask for layout in pg_init
fs/nfs/blocklayout/blocklayout.c | 54 ++++++++++++++++++++++++++-
fs/nfs/pnfs.c | 74 +++++++++++++++++++++++++++++++++++++-
fs/nfs/pnfs.h | 9 +++++
3 files changed, 134 insertions(+), 3 deletions(-)
On Tue, Nov 29, 2011 at 04:50:23PM -0800, Boaz Harrosh wrote:
> On 11/29/2011 04:37 PM, Trond Myklebust wrote:
> > On Tue, 2011-11-29 at 16:20 -0800, Marc Eshel wrote:
> >> You ignored my main point, I was talking about the server side, my point
> >> was that there is nothing to build on on the serve side since the pNFS
> >> Linux server is not happening.
> >> Marc.
> >
> > Sorry. I misunderstood your concern. As far as I know, the main problem
> > there is also one of investment: nobody has stepped up to help Bruce
> > write a pNFS server.
> >
>
> I use "OUR" Linux pNFS Server every day, all day. It's there it is written
> and it kicks us. We've been working on that for 5 Years now.
>
> The only holding factor is that no one wants to do the final step and
> fight the VFS guys to actually push it into Linus.
As I've said before, my only requirement is that we hold off on merging
optional 4.1 features (like pNFS) until the *mandatory* parts of 4.1
comply with the spec. Once that happens, I'm happy to help with merging
pNFS however I can, including working out any VFS problems.
Most of us have been assuming that the GFS2-backed server would be the
simplest to merge first, but with the client supporting all three layout
types, maybe it would be reasonable to start with the exofs-backed
server if that's in better shape.
To help the process along I'm continuing to maintain a 4.1 todo list:
http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues#Highest_priority
I believe that could all be done relatively soon with some more help.
--b.
> But it is there and alive in full open-source glory in Benny's tree.
>
> > I'm less worried about this now than I was earlier, because other open
> > source efforts are gaining traction (see Ganesha - which is being
> > sponsored by IBM, and projects such as Tigran's java based pNFS server).
> > The other point is that we've developed client test-rigs that don't
> > depend on the availability of a Linux server (newpynfs and the pynfs
> > based proxy).
> >
>
> For me I'm testing with the linux Server and it is just there and works
> well for me. Better than any other solution I have right now.
>
> > Cheers
> > Trond
> >
>
> Ciao
> Heart
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
T24gV2VkLCBOb3YgMzAsIDIwMTEgYXQgOTowMSBQTSwgQmVubnkgSGFsZXZ5IDxiaGFsZXZ5QHRv
bmlhbi5jb20+IHdyb3RlOgo+IE9uIDIwMTEtMTItMDMgMDY6NTIsIFBlbmcgVGFvIHdyb3RlOgo+
PiBUaGlzIGdpdmVzIExEIG9wdGlvbiBub3QgdG8gYXNrIGZvciBsYXlvdXQgaW4gcGdfaW5pdC4K
Pj4KPj4gU2lnbmVkLW9mZi1ieTogUGVuZyBUYW8gPHBlbmdfdGFvQGVtYy5jb20+Cj4+IC0tLQo+
PiDCoGZzL25mcy9wbmZzLmMgfCDCoCA0NiArKysrKysrKysrKysrKysrKysrKysrKysrKysrKysr
KysrKysrKysrKysrKysrCj4+IMKgMSBmaWxlcyBjaGFuZ2VkLCA0NiBpbnNlcnRpb25zKCspLCAw
IGRlbGV0aW9ucygtKQo+Pgo+PiBkaWZmIC0tZ2l0IGEvZnMvbmZzL3BuZnMuYyBiL2ZzL25mcy9w
bmZzLmMKPj4gaW5kZXggNzM0ZTY3MC4uYzhkYzBiMSAxMDA2NDQKPj4gLS0tIGEvZnMvbmZzL3Bu
ZnMuYwo+PiArKysgYi9mcy9uZnMvcG5mcy5jCj4+IEBAIC0xMjU0LDYgKzEyNTQsNyBAQCBwbmZz
X2RvX211bHRpcGxlX3dyaXRlcyhzdHJ1Y3QgbmZzX3BhZ2Vpb19kZXNjcmlwdG9yICpkZXNjLCBz
dHJ1Y3QgbGlzdF9oZWFkICpoZQo+PiDCoCDCoCDCoCBzdHJ1Y3QgbmZzX3dyaXRlX2RhdGEgKmRh
dGE7Cj4+IMKgIMKgIMKgIGNvbnN0IHN0cnVjdCBycGNfY2FsbF9vcHMgKmNhbGxfb3BzID0gZGVz
Yy0+cGdfcnBjX2NhbGxvcHM7Cj4+IMKgIMKgIMKgIHN0cnVjdCBwbmZzX2xheW91dF9zZWdtZW50
ICpsc2VnID0gZGVzYy0+cGdfbHNlZzsKPj4gKyDCoCDCoCBjb25zdCBib29sIGhhc19sc2VnID0g
ISFsc2VnOwo+Cj4gbml0OiAiaGFzX2xzZWcgPSAobHNlZyAhPSBOVUxMKSIgd291bGQgYmUgbW9y
ZSBzdHJhaWdodCBmb3J3YXJkIElNTwo+Cj4+Cj4+IMKgIMKgIMKgIGRlc2MtPnBnX2xzZWcgPSBO
VUxMOwo+PiDCoCDCoCDCoCB3aGlsZSAoIWxpc3RfZW1wdHkoaGVhZCkpIHsKPj4gQEAgLTEyNjIs
NyArMTI2MywyOSBAQCBwbmZzX2RvX211bHRpcGxlX3dyaXRlcyhzdHJ1Y3QgbmZzX3BhZ2Vpb19k
ZXNjcmlwdG9yICpkZXNjLCBzdHJ1Y3QgbGlzdF9oZWFkICpoZQo+PiDCoCDCoCDCoCDCoCDCoCDC
oCDCoCBkYXRhID0gbGlzdF9lbnRyeShoZWFkLT5uZXh0LCBzdHJ1Y3QgbmZzX3dyaXRlX2RhdGEs
IGxpc3QpOwo+PiDCoCDCoCDCoCDCoCDCoCDCoCDCoCBsaXN0X2RlbF9pbml0KCZkYXRhLT5saXN0
KTsKPj4KPj4gKyDCoCDCoCDCoCDCoCDCoCDCoCBpZiAoIWhhc19sc2VnKSB7Cj4+ICsgwqAgwqAg
wqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgc3RydWN0IG5mc19wYWdlICpyZXEgPSBuZnNfbGlzdF9l
bnRyeShkYXRhLT5wYWdlcy5uZXh0KTsKPj4gKyDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDC
oCBfX3U2NCBsZW5ndGggPSBkYXRhLT5ucGFnZXMgPDwgUEFHRV9DQUNIRV9TSElGVDsKPj4gKwo+
PiArIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIGxzZWcgPSBwbmZzX3VwZGF0ZV9sYXlv
dXQoZGVzYy0+cGdfaW5vZGUsCj4+ICsgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAg
wqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgcmVxLT53Yl9jb250ZXh0LAo+PiAr
IMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKg
IMKgIMKgIMKgIMKgIHJlcV9vZmZzZXQocmVxKSwKPj4gKyDCoCDCoCDCoCDCoCDCoCDCoCDCoCDC
oCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCBsZW5ndGgsCj4+
ICsgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAg
wqAgwqAgwqAgwqAgwqAgSU9NT0RFX1JXLAo+PiArIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKg
IMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIEdGUF9OT0ZTKTsKPj4g
KyDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCBpZiAoIWxzZWcgfHwgbGVuZ3RoID4gKGxz
ZWctPnBsc19yYW5nZS5sZW5ndGgpKSB7Cj4KPiBJJ20gY29uY2VybmVkIGFib3V0IHRoZSAnbGVu
Z3RoJyBwYXJ0IG9mIHRoaXMgY29uZGl0aW9uLgo+IHBuZnNfdHJ5X3RvX3dyaXRlX2RhdGEgc2hv
dWxkIGhhbmRsZSBzaG9ydCB3cml0ZXMvcmVhZHMgYW5kCj4gd2Ugc2hvdWxkIGJlIGFibGUgdG8g
aXRlcmF0ZSB0aHJvdWdoIHRoZSBJL08gdXNpbmcgZGlmZmVyZW50Cj4gbGF5b3V0IHNlZ21lbnRz
LgpUcm9uZCBoYXMgc3VnZ2VzdGVkIG1vdmluZyBhbGwgdGhlc2UgYmFjayBpbnNpZGUgYmxvY2sg
bGF5b3V0LiBBbmQgSQp0aGluayBJIGNhbiBoYW5kbGUgc2hvcnQgcmVhZHMvd3JpdGVzIHNpdHVh
dGlvbiB0aGVyZS4KCj4KPj4gKyDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDC
oCDCoCBwdXRfbHNlZyhsc2VnKTsKPj4gKyDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDC
oCDCoCDCoCDCoCBsc2VnID0gTlVMTDsKPj4gKyDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDC
oCDCoCDCoCDCoCDCoCBwbmZzX3dyaXRlX3Rocm91Z2hfbWRzKGRlc2MsZGF0YSk7Cj4+ICsgwqAg
wqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgY29udGludWU7Cj4+ICsgwqAg
wqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgfQo+PiArIMKgIMKgIMKgIMKgIMKgIMKgIH0KPj4g
Kwo+PiDCoCDCoCDCoCDCoCDCoCDCoCDCoCB0cnlwbmZzID0gcG5mc190cnlfdG9fd3JpdGVfZGF0
YShkYXRhLCBjYWxsX29wcywgbHNlZywgaG93KTsKPj4gKyDCoCDCoCDCoCDCoCDCoCDCoCBpZiAo
IWhhc19sc2VnKSB7Cj4+ICsgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgcHV0X2xzZWco
bHNlZyk7Cj4+ICsgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgbHNlZyA9IE5VTEw7Cj4+
ICsgwqAgwqAgwqAgwqAgwqAgwqAgfQo+Cj4gV2UgaGFkIGFuIGltcGxlbWVudGF0aW9uIGluIHRo
ZSBwYXN0IHRoYXQgc2F2ZWQgdGhlIG1vc3QgcmVjZW50IGxzZWcgaW4gJ2Rlc2MnCj4gc28gaXQg
Y291bGQgYmUgdXNlZCBmb3IgdGhlIHJlbWFpbmluZyByZXF1ZXN0cy4gwqBPbmNlIGV4aGF1c3Rl
ZCwgeW91IGNhbgo+IGxvb2sgZm9yIGEgbmV3IG9uZS4KWWVzLCB0aGUgZGVzYy0+bHNlZyBpcyB3
aGF0IEknbSBnb2luZyB0byB1c2UgaW4gYmxvY2tsYXlvdXQgcHJpdmF0ZSBwZ19kb2lvLgoKPgo+
PiDCoCDCoCDCoCDCoCDCoCDCoCDCoCBpZiAodHJ5cG5mcyA9PSBQTkZTX05PVF9BVFRFTVBURUQp
Cj4+IMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIHBuZnNfd3JpdGVfdGhyb3VnaF9t
ZHMoZGVzYywgZGF0YSk7Cj4+IMKgIMKgIMKgIH0KPj4gQEAgLTEzNTAsNiArMTM3Myw3IEBAIHBu
ZnNfZG9fbXVsdGlwbGVfcmVhZHMoc3RydWN0IG5mc19wYWdlaW9fZGVzY3JpcHRvciAqZGVzYywg
c3RydWN0IGxpc3RfaGVhZCAqaGVhCj4+IMKgIMKgIMKgIHN0cnVjdCBuZnNfcmVhZF9kYXRhICpk
YXRhOwo+PiDCoCDCoCDCoCBjb25zdCBzdHJ1Y3QgcnBjX2NhbGxfb3BzICpjYWxsX29wcyA9IGRl
c2MtPnBnX3JwY19jYWxsb3BzOwo+PiDCoCDCoCDCoCBzdHJ1Y3QgcG5mc19sYXlvdXRfc2VnbWVu
dCAqbHNlZyA9IGRlc2MtPnBnX2xzZWc7Cj4+ICsgwqAgwqAgY29uc3QgYm9vbCBoYXNfbHNlZyA9
ICEhbHNlZzsKPgo+IGRpdHRvCj4KPj4KPj4gwqAgwqAgwqAgZGVzYy0+cGdfbHNlZyA9IE5VTEw7
Cj4+IMKgIMKgIMKgIHdoaWxlICghbGlzdF9lbXB0eShoZWFkKSkgewo+PiBAQCAtMTM1OCw3ICsx
MzgyLDI5IEBAIHBuZnNfZG9fbXVsdGlwbGVfcmVhZHMoc3RydWN0IG5mc19wYWdlaW9fZGVzY3Jp
cHRvciAqZGVzYywgc3RydWN0IGxpc3RfaGVhZCAqaGVhCj4+IMKgIMKgIMKgIMKgIMKgIMKgIMKg
IGRhdGEgPSBsaXN0X2VudHJ5KGhlYWQtPm5leHQsIHN0cnVjdCBuZnNfcmVhZF9kYXRhLCBsaXN0
KTsKPj4gwqAgwqAgwqAgwqAgwqAgwqAgwqAgbGlzdF9kZWxfaW5pdCgmZGF0YS0+bGlzdCk7Cj4+
Cj4+ICsgwqAgwqAgwqAgwqAgwqAgwqAgaWYgKCFoYXNfbHNlZykgewo+PiArIMKgIMKgIMKgIMKg
IMKgIMKgIMKgIMKgIMKgIMKgIHN0cnVjdCBuZnNfcGFnZSAqcmVxID0gbmZzX2xpc3RfZW50cnko
ZGF0YS0+cGFnZXMubmV4dCk7Cj4+ICsgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgX191
NjQgbGVuZ3RoID0gZGF0YS0+bnBhZ2VzIDw8IFBBR0VfQ0FDSEVfU0hJRlQ7Cj4+ICsKPj4gKyDC
oCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCBsc2VnID0gcG5mc191cGRhdGVfbGF5b3V0KGRl
c2MtPnBnX2lub2RlLAo+PiArIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKg
IMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIHJlcS0+d2JfY29udGV4dCwKPj4gKyDCoCDC
oCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDC
oCDCoCDCoCByZXFfb2Zmc2V0KHJlcSksCj4+ICsgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAg
wqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgbGVuZ3RoLAo+PiArIMKg
IMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKg
IMKgIMKgIMKgIElPTU9ERV9SRUFELAo+PiArIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKg
IMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIEdGUF9LRVJORUwpOwo+PiAr
IMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIGlmICghbHNlZyB8fCBsZW5ndGggPiBsc2Vn
LT5wbHNfcmFuZ2UubGVuZ3RoKSB7Cj4+ICsgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAg
wqAgwqAgwqAgwqAgcHV0X2xzZWcobHNlZyk7Cj4+ICsgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAg
wqAgwqAgwqAgwqAgwqAgwqAgbHNlZyA9IE5VTEw7Cj4+ICsgwqAgwqAgwqAgwqAgwqAgwqAgwqAg
wqAgwqAgwqAgwqAgwqAgwqAgwqAgcG5mc19yZWFkX3Rocm91Z2hfbWRzKGRlc2MsIGRhdGEpOwo+
PiArIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIGNvbnRpbnVlOwo+
PiArIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIH0KPj4gKyDCoCDCoCDCoCDCoCDCoCDC
oCB9Cj4+ICsKPj4gwqAgwqAgwqAgwqAgwqAgwqAgwqAgdHJ5cG5mcyA9IHBuZnNfdHJ5X3RvX3Jl
YWRfZGF0YShkYXRhLCBjYWxsX29wcywgbHNlZyk7Cj4+ICsgwqAgwqAgwqAgwqAgwqAgwqAgaWYg
KCFoYXNfbHNlZykgewo+PiArIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIHB1dF9sc2Vn
KGxzZWcpOwo+PiArIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIMKgIGxzZWcgPSBOVUxMOwo+
PiArIMKgIMKgIMKgIMKgIMKgIMKgIH0KPgo+IGRpdHRvCj4KPiBCZW5ueQo+Cj4+IMKgIMKgIMKg
IMKgIMKgIMKgIMKgIGlmICh0cnlwbmZzID09IFBORlNfTk9UX0FUVEVNUFRFRCkKPj4gwqAgwqAg
wqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgcG5mc19yZWFkX3Rocm91Z2hfbWRzKGRlc2MsIGRh
dGEpOwo+PiDCoCDCoCDCoCB9CgoKCi0tIApUaGFua3MsClRhbwo=
On 11/29/2011 02:47 PM, Trond Myklebust wrote:
> On Tue, 2011-11-29 at 14:40 -0800, Boaz Harrosh wrote:
>> On 11/29/2011 01:57 PM, Trond Myklebust wrote:
>>>> Also Files when they will support segments and servers that request segments,
>>>> like the CEPH server, will very much enjoy the above, .i.e: Tell me the amount
>>>> you know you want to write.
>>>
>>> Why would we want to add segment support to the pNFS files client???
>>> Segments are a nuisance that multiply the amount of unnecessary chitchat
>>> between the client and the MDS without providing any tangible
>>> benefits...
>>>
>>
>> Your kidding right?
>>
>> One: it is mandated by the Standard, This is not an option. So a perfectly
>> Standard complaint server is not Supported by Linux because we don't see
>> the point.
>
> Bollocks.. Nothing is "mandated by the Standard". If the server doesn't
> give us a full layout, then we fall back to write through MDS. Why dick
> around with crap that SLOWS YOU DOWN.
>
NO! MAKE YOU FASTER.
The kind of typologies I'm talking about a single layout get ever 1GB is
marginal to the gain I get in deploying 100 of DSs. I have thousands of
DSs I want to spread the load evenly. I'm limited by the size of the layout
(Device info in the case of files) So I'm limited by the number of DSs I can
have in a layout. For large files these few devices become an hot spot all
the while the rest of the cluster is idle.
This is not a theory, we meet these problems ever day.
>> Two: There are already file-layout servers out there (multiple) which are
>> waiting for the Linux files-layout segment support, because the underline
>> FS requires Segments and now they do not work with the Linux client. These
>> are CEPH and GPFS and more.
>
> Then they will have a _long_ wait....
>
OK, so now I understand. Because when I was talking to Fred before BAT and during
It was very very peculiar to me why he is not already done with that simple stuff.
Because usually Fred is such a brilliant fast programmer that I admire, and that simple
crap?
But now that explains
> Trond
>
Heart
The only 1 cent that I can add to this argument is that I was led to
believe by you and others that Linux kernel don't add functionality on the
client side that is not supported on the server side. Last time I made
this point I was told that it is ok if they are of by a version or two.
You made it clear that this is no longer true and that the Linux client
and server are now independent of each other. I spent time working on the
server side believing that the client and server will progress more or
less together, disappointed to find out that it is not the case any more.
Marc.
From:
Trond Myklebust <[email protected]>
To:
Boaz Harrosh <[email protected]>
Cc:
Peng Tao <[email protected]>, [email protected],
[email protected], Garth Gibson <[email protected]>, Matt Benjamin
<[email protected]>, Marc Eshel <[email protected]>, Fred Isaman
<[email protected]>
Date:
11/29/2011 03:30 PM
Subject:
Re: [PATCH 0/4] nfs41: allow layoutget at pnfs_do_multiple_writes
On Tue, 2011-11-29 at 14:58 -0800, Boaz Harrosh wrote:
> On 11/29/2011 02:47 PM, Trond Myklebust wrote:
> > On Tue, 2011-11-29 at 14:40 -0800, Boaz Harrosh wrote:
> >> On 11/29/2011 01:57 PM, Trond Myklebust wrote:
> >>>> Also Files when they will support segments and servers that request
segments,
> >>>> like the CEPH server, will very much enjoy the above, .i.e: Tell me
the amount
> >>>> you know you want to write.
> >>>
> >>> Why would we want to add segment support to the pNFS files client???
> >>> Segments are a nuisance that multiply the amount of unnecessary
chitchat
> >>> between the client and the MDS without providing any tangible
> >>> benefits...
> >>>
> >>
> >> Your kidding right?
> >>
> >> One: it is mandated by the Standard, This is not an option. So a
perfectly
> >> Standard complaint server is not Supported by Linux because we
don't see
> >> the point.
> >
> > Bollocks.. Nothing is "mandated by the Standard". If the server
doesn't
> > give us a full layout, then we fall back to write through MDS. Why
dick
> > around with crap that SLOWS YOU DOWN.
> >
>
> NO! MAKE YOU FASTER.
>
> The kind of typologies I'm talking about a single layout get ever 1GB is
> marginal to the gain I get in deploying 100 of DSs. I have thousands of
> DSs I want to spread the load evenly. I'm limited by the size of the
layout
> (Device info in the case of files) So I'm limited by the number of DSs I
can
> have in a layout. For large files these few devices become an hot spot
all
> the while the rest of the cluster is idle.
I call "bullshit" on that whole argument...
You've done sod all so far to address the problem of a client managing
layout segments for a '1000 DS' case. Are you expecting that all pNFS
object servers out there are going to do that for you? How do I assume
that a generic pNFS files server is going to do the same? As far as I
know, the spec is completely moot on the whole subject.
IOW: I'm not even remotely interested in your "everyday problems" if
there are no "everyday solutions" that actually fit the generic can of
spec worms that the pNFS layout segments open.
> >> Two: There are already file-layout servers out there (multiple) which
are
> >> waiting for the Linux files-layout segment support, because the
underline
> >> FS requires Segments and now they do not work with the Linux
client. These
> >> are CEPH and GPFS and more.
> >
> > Then they will have a _long_ wait....
> >
>
> OK, so now I understand. Because when I was talking to Fred before BAT
and during
> It was very very peculiar to me why he is not already done with that
simple stuff.
> Because usually Fred is such a brilliant fast programmer that I admire,
and that simple
> crap?
>
> But now that explains
Yes. It's all a big conspiracy, and we're deliberately holding Fred's
genius back in order to disappoint you...
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On 2011-11-30 02:58, Trond Myklebust wrote:
> On Tue, 2011-11-29 at 16:24 -0800, Boaz Harrosh wrote:
>> On 11/29/2011 03:30 PM, Trond Myklebust wrote:
>>> On Tue, 2011-11-29 at 14:58 -0800, Boaz Harrosh wrote:
>>>>
>>>> The kind of typologies I'm talking about a single layout get ever 1GB is
>>>> marginal to the gain I get in deploying 100 of DSs. I have thousands of
>>>> DSs I want to spread the load evenly. I'm limited by the size of the layout
>>>> (Device info in the case of files) So I'm limited by the number of DSs I can
>>>> have in a layout. For large files these few devices become an hot spot all
>>>> the while the rest of the cluster is idle.
>>>
>>> I call "bullshit" on that whole argument...
>>>
>>> You've done sod all so far to address the problem of a client managing
>>
>> sod? I don't know this word?
>
> 'sod all' == 'nothing'
>
> it's an English slang...
>
>>> layout segments for a '1000 DS' case. Are you expecting that all pNFS
>>> object servers out there are going to do that for you? How do I assume
>>> that a generic pNFS files server is going to do the same? As far as I
>>> know, the spec is completely moot on the whole subject.
>>>
>>
>> What? The all segments thing is in the Generic part of the spec and is not
>> at all specific or even specified in the objects and blocks RFCs.
>
> ..and it doesn't say _anything_ about how a client is supposed to manage
> them in order to maximise efficiency.
>
>> There is no layout in the spec, there are only layout_segments. Actually
>> what we call layout_segments, in the spec, it is simply called a layout.
>>
>> The client asks for a layout (segment) and gets one. An ~0 length one
>> is just a special case. Without layout_get (segment) there is no optional
>> pnfs support.
>>
>> So we are reading two different specs because to me it clearly says
>> layout - which is a segment.
>>
>> Because the way I read it the pNFS is optional in 4.1. But if I'm a
>> pNFS client I need to expect layouts (segments)
>>
>>> IOW: I'm not even remotely interested in your "everyday problems" if
>>> there are no "everyday solutions" that actually fit the generic can of
>>> spec worms that the pNFS layout segments open.
>>
>> That I don't understand. What "spec worms that the pNFS layout segments open"
>> Are you seeing. Because it works pretty simple for me. And I don't see the
>> big difference for files. One thing I learned for the past is that when you
>> have concerns I should understand them and start to address them. Because
>> your insights are usually on the Money. If you are concerned then there is
>> something I should fix.
>
> I'm saying that if I need to manage layouts that deal with >1000 DSes,
> then I presumably need a strategy for ensuring that I return/forget
> segments that are no longer needed, and I need a strategy for ensuring
> that I always hold the segments that I do need; otherwise, I could just
> ask for a full-file layout and deal with the 1000 DSes (which is what we
> do today)...111
How about LRU based caching to start with?
>
> My problem is that the spec certainly doesn't give me any guidance as to
> such a strategy, and I haven't seen anybody else step up to the plate.
> In fact, I strongly suspect that such a strategy is going to be very
> application specific.
The spec doesn't give much guidance to the client as per data caching
replacement algorithms either and still we cache data in the client and
do our best to accommodate the application needs.
>
> IOW: I don't accept that a layout-segment based solution is useful
> without some form of strategy for telling me which segments to keep and
> which to throw out when I start hitting client resource limits. I also
> haven't seen any strategy out there for setting loga_length (as opposed
> to loga_minlength) in the LAYOUTGET requests: as far as I know that is
> going to be heavily application-dependent in the 1000-DS world.
>
My approach has always been: the client should ask for what it knows about
and the server may optimize over it. If the client can anticipate the
application behavior, a-la sequential read-ahead it can attempt to
use that, but the server has better knowledge of the entire cluster workload
to determine the appropriate layout segment range.
Benny
Let me clarify: there are files based servers, our Ceph on Ganesha server is one, which have file allocation not satisfied by whole-file layouts. I would think that demonstrating this would be sufficient to get support from the Linux client to support appropriate segment management, at any rate, if someone is willing to write and support the required code, or already has. One of those alternatives is certainly the case. By the way, we wrote generic pNFS, pNFS files support for Ganesha and, with a big dose of help from Panasas, are taking it to merge.
Matt
----- "Matt W. Benjamin" <[email protected]> wrote:
> That would be pretty disappointing. However, based on previous
> interactions, my belief would be, the
> Linux client will do what can be shown empirically to work better, or
> more correctly.
>
> Matt
>
> ----- "Trond Myklebust" <[email protected]> wrote:
>
> > On Tue, 2011-11-29 at 14:40 -0800, Boaz Harrosh wrote:
> > > On 11/29/2011 01:57 PM, Trond Myklebust wrote:
> > > >> Also Files when they will support segments and servers that
> > request segments,
> > > >> like the CEPH server, will very much enjoy the above, .i.e:
> Tell
> > me the amount
> > > >> you know you want to write.
> > > >
> > > > Why would we want to add segment support to the pNFS files
> > client???
> > > > Segments are a nuisance that multiply the amount of unnecessary
> > chitchat
> > > > between the client and the MDS without providing any tangible
> > > > benefits...
> > > >
> > >
> > > Your kidding right?
> > >
> > > One: it is mandated by the Standard, This is not an option. So a
> > perfectly
> > > Standard complaint server is not Supported by Linux because
> we
> > don't see
> > > the point.
> >
> > Bollocks.. Nothing is "mandated by the Standard". If the server
> > doesn't
> > give us a full layout, then we fall back to write through MDS. Why
> > dick
> > around with crap that SLOWS YOU DOWN.
> >
> > > Two: There are already file-layout servers out there (multiple)
> > which are
> > > waiting for the Linux files-layout segment support, because
> the
> > underline
> > > FS requires Segments and now they do not work with the Linux
> > client. These
> > > are CEPH and GPFS and more.
> >
> > Then they will have a _long_ wait....
> >
> > Trond
> >
> > --
> > Trond Myklebust
> > Linux NFS client maintainer
> >
> > NetApp
> > [email protected]
> > http://www.netapp.com
>
> --
>
> Matt Benjamin
>
> The Linux Box
> 206 South Fifth Ave. Suite 150
> Ann Arbor, MI 48104
>
> http://linuxbox.com
>
> tel. 734-761-4689
> fax. 734-769-8938
> cel. 734-216-5309
--
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI 48104
http://linuxbox.com
tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309
On 11/29/2011 04:08 PM, Trond Myklebust wrote:
> On Tue, 2011-11-29 at 15:49 -0800, Marc Eshel wrote:
>> The only 1 cent that I can add to this argument is that I was led to
>> believe by you and others that Linux kernel don't add functionality on the
>> client side that is not supported on the server side. Last time I made
>> this point I was told that it is ok if they are of by a version or two.
>> You made it clear that this is no longer true and that the Linux client
>> and server are now independent of each other. I spent time working on the
>> server side believing that the client and server will progress more or
>> less together, disappointed to find out that it is not the case any more.
>> Marc.
>
> I don't know how to manage layout segments in a way that meets the pNFS
> goals of scalability and performance, and neither you nor the pNFS spec
> have told me how to do this.
This I do not understand. What is the problem with current implementation?
I use these segments every day and see performance through the roof. I'm
able to saturate any cluster I throw at it and am happy as bliss. And it's
surprisingly stable. My IO flow is full of segments and recalls, and it
works surprisingly well.
(I intend to send some hard numbers next week. But believe me they are
amassing)
>
> If IBM wants a client that implements layout segments, then it is _your_
> responsibility to:
> A. Convince me that it is an implementable part of the spec.
Sure
> B. Provide me with an implementation that deals all with the
> concerns that I have.
>
What concerns are that. If it's the COMMITs then I think I know what
todo.
> IOW: nobody has ever promised IBM that the community would do all your
> client work for you. EMC, NetApp and Panasas have implemented (most of)
> the bits that they cared about; any bits that they haven't implemented
> and that you care about are up to you.
>
I think I'll pick this up. Me or some interested people I know. Until now
it was said that Fred is working on that and we waited patiently for him
to do it out of respect and to save any wasted efforts. But if it's dropped
on the floor, then I'm all the gladder to pick it up. Just give me the green
light because I do not want to duplicate efforts.
Thanks
Heart
That would be pretty disappointing. However, based on previous interactions, my belief would be, the
Linux client will do what can be shown empirically to work better, or more correctly.
Matt
----- "Trond Myklebust" <[email protected]> wrote:
> On Tue, 2011-11-29 at 14:40 -0800, Boaz Harrosh wrote:
> > On 11/29/2011 01:57 PM, Trond Myklebust wrote:
> > >> Also Files when they will support segments and servers that
> request segments,
> > >> like the CEPH server, will very much enjoy the above, .i.e: Tell
> me the amount
> > >> you know you want to write.
> > >
> > > Why would we want to add segment support to the pNFS files
> client???
> > > Segments are a nuisance that multiply the amount of unnecessary
> chitchat
> > > between the client and the MDS without providing any tangible
> > > benefits...
> > >
> >
> > Your kidding right?
> >
> > One: it is mandated by the Standard, This is not an option. So a
> perfectly
> > Standard complaint server is not Supported by Linux because we
> don't see
> > the point.
>
> Bollocks.. Nothing is "mandated by the Standard". If the server
> doesn't
> give us a full layout, then we fall back to write through MDS. Why
> dick
> around with crap that SLOWS YOU DOWN.
>
> > Two: There are already file-layout servers out there (multiple)
> which are
> > waiting for the Linux files-layout segment support, because the
> underline
> > FS requires Segments and now they do not work with the Linux
> client. These
> > are CEPH and GPFS and more.
>
> Then they will have a _long_ wait....
>
> Trond
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
--
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI 48104
http://linuxbox.com
tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309
DQo+IC0tLS0tT3JpZ2luYWwgTWVzc2FnZS0tLS0tDQo+IEZyb206IGxpbnV4LW5mcy1vd25lckB2
Z2VyLmtlcm5lbC5vcmcgW21haWx0bzpsaW51eC1uZnMtb3duZXJAdmdlci5rZXJuZWwub3JnXSBP
biBCZWhhbGYgT2YgVHJvbmQNCj4gTXlrbGVidXN0DQo+IFNlbnQ6IFdlZG5lc2RheSwgTm92ZW1i
ZXIgMzAsIDIwMTEgMTo0MyBBTQ0KPiBUbzogUGVuZyBUYW8NCj4gQ2M6IGxpbnV4LW5mc0B2Z2Vy
Lmtlcm5lbC5vcmc7IGJoYWxldnlAdG9uaWFuLmNvbTsgUGVuZywgVGFvDQo+IFN1YmplY3Q6IFJl
OiBbUEFUQ0ggNC80XSBwbmZzYmxvY2s6IGRvIGFzayBmb3IgbGF5b3V0IGluIHBnX2luaXQNCj4g
DQo+IE9uIFdlZCwgMjAxMS0xMS0zMCBhdCAwMToyNSArMDgwMCwgUGVuZyBUYW8gd3JvdGU6DQo+
ID4gT24gV2VkLCBOb3YgMzAsIDIwMTEgYXQgMTI6NDAgQU0sIFRyb25kIE15a2xlYnVzdA0KPiA+
IDxUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbT4gd3JvdGU6DQo+ID4gPiBPbiBGcmksIDIwMTEt
MTItMDIgYXQgMjA6NTIgLTA4MDAsIFBlbmcgVGFvIHdyb3RlOg0KPiA+ID4+IEFza2luZyBmb3Ig
bGF5b3V0IGluIHBnX2luaXQgd2lsbCBhbHdheXMgbWFrZSBjbGllbnQgYXNrIGZvciBvbmx5IDRL
Qg0KPiA+ID4+IGxheW91dCBpbiBldmVyeSBsYXlvdXRnZXQuIFRoaXMgd2F5LCBjbGllbnQgZHJv
cHMgdGhlIElPIHNpemUgaW5mb3JtYXRpb24NCj4gPiA+PiB0aGF0IGlzIG1lYW5pbmdmdWwgZm9y
IE1EUyBpbiBoYW5kaW5nIG91dCBsYXlvdXQuDQo+ID4gPj4NCj4gPiA+PiBJbiBzdGVhZCwgaWYg
bGF5b3V0IGlzIG5vdCBmaW5kIGluIGNhY2hlLCBkbyBub3Qgc2VuZCBsYXlvdXRnZXQNCj4gPiA+
PiBhdCBvbmNlLiBXYWl0IHVudGlsIGJlZm9yZSBpc3N1aW5nIElPIGluIHBuZnNfZG9fbXVsdGlw
bGVfcmVhZHMvd3JpdGVzDQo+ID4gPj4gYmVjYXVzZSB0aGF0IGlzIHdoZXJlIHdlIGtub3cgdGhl
IHJlYWwgc2l6ZSBvZiBjdXJyZW50IElPLiBCeSB0ZWxsaW5nIHRoZQ0KPiA+ID4+IHJlYWwgSU8g
c2l6ZSB0byBNRFMsIE1EUyB3aWxsIGhhdmUgYSBiZXR0ZXIgY2hhbmNlIHRvIGdpdmUgcHJvcGVy
IGxheW91dC4NCj4gPiA+DQo+ID4gPiBXaHkgY2FuJ3QgeW91IGp1c3Qgc3BsaXQgcG5mc191cGRh
dGVfbGF5b3V0KCkgaW50byAyIHN1Yi1mdW5jdGlvbnMNCj4gPiA+IGluc3RlYWQgb2YgZHVwbGlj
YXRpbmcgaXQgaW4gcHJpdmF0ZSBibG9jayBjb2RlPw0KPiA+IEJlY2F1c2UgSSB3YW50ZWQgdG8g
ZGlmZmVyZW50aWF0ZSBiZXR3ZWVuIG5vIGxheW91dCBoZWFkZXIgYW5kIG5vDQo+ID4gY2FjaGVk
IGxzZWcsIHdoZXJlIHRoZSBwbmZzX3VwZGF0ZV9sYXlvdXQoKSBpbnRlcmZhY2UgaXMgbm90IGVu
b3VnaCB0bw0KPiA+IHRlbGwgdGhlIGRpZmZlcmVuY2UuIE9mIGNvdXJzZSBJIGNhbiBwdXQgdGhl
c2UgYWxsIGludG8gZ2VuZXJpYyBsYXllci4NCj4gPiBJIHdpbGwgdXBkYXRlIHRoZSBwYXRjaHNl
dCB0byBkbyBpdC4NCj4gPg0KPiA+ID4NCj4gPiA+IFRoZW4gY2FsbCBsYXlvdXRnZXQgaW4geW91
ciBwZ19kb2lvKCkgY2FsbGJhY2sgaW5zdGVhZCBvZiBhZGRpbmcgYQ0KPiA+ID4gcmVkdW5kYW50
IHBuZnNfdXBkYXRlX2xheW91dCB0bw0KPiA+ID4gcG5mc19kb19tdWx0aXBsZV9yZWFkcy9wbmZz
X2RvX211bHRpcGxlX3dyaXRlcy4uLg0KPiA+IEkgaGF2ZSBjb25zaWRlcmVkIGl0IGJlZm9yZSBi
dXQgdXNpbmcgcHJpdmF0ZSBwZ19kb2lvKCkgbWVhbnMgd2Ugd2lsbA0KPiA+IGhhdmUgYXMgbXVj
aCBkdXBsaWNhdGlvbiBvZiBwbmZzX2dlbmVyaWNfcGdfcmVhZC93cml0ZXBhZ2VzLg0KPiANCj4g
V2h5PyBBbGwgeW91IG5lZWQgdG8gZG8gaXMgc2VuZCB0aGUgbGF5b3V0Z2V0LCBhbmQgdGhlbiBj
YWxsIHRoZQ0KPiBleGlzdGluZyBwbmZzX2dlbmVyaWNfcGdfcmVhZC93cml0ZXBhZ2VzPw0KPiAN
Cj4gVGhlIGRpZmZlcmVuY2UgaGVyZSBpcyB0aGF0IHlvdSdyZSBhZGRpbmcgdGhhdCBzdGVwIGlu
dG8NCj4gcG5mc19nZW5lcmljX3BnX3JlYWQvd3JpdGVwYWdlcyBpbiBwYXRjaCAzLzQuIEJhc2lj
YWxseSB5b3UgYXJlIGFkZGluZw0KPiBibG9jay1zcGVjaWZpYyBjb2RlIGludG8gYW4gb3RoZXJ3
aXNlIGdlbmVyaWMgZnVuY3Rpb24gaW5zdGVhZCBvZiBkb2luZw0KPiBpdCBjbGVhbmx5IGluIHRo
ZSBibG9jay1zcGVjaWZpYyBjYWxsYmFja3MuDQpJIHNlZSB5b3VyIHBvaW50LiBJIHVzZWQgdG8g
dGhpbmsgdGhhdCBwZ19kZXNjIHRha2VzIG11bHRpcGxlIG5mc19yZWFkL3dyaXRlX2RhdGEgd2hl
biBjYWxsaW5nIHBuZnNfZG9fbXVsdGlwbGVfcmVhZHMvd3JpdGUuIFNvIEkgY2hvc2UgdG8gY2hh
bmdlIGl0IHRoZXJlIHNvIHRoYXQgZWFjaCBuZnNfcmVhZC93cml0ZV9kYXRhIGlzIHRyZWF0ZWQg
aW5kaXZpZHVhbGx5LiBCdXQgYXBwYXJlbnRseSBmb3IgYmxvY2tsYXlvdXQsIHdlIHdvbid0IGNh
bGwgaW50byBuZnNfZmx1c2hfbXVsdGkgYWxpa2Ugc28gcGdfZGVzYyBjYW4gYmUgdHJlYXRlZCBh
cyBhIHdob2xlLiBJIHdpbGwgbWFrZSB0aGUgY2hhbmdlIGFzIHlvdSBzdWdnZXN0ZWQuDQoNClRo
YW5rcywNClRhbw0KDQo+IA0KPiAtLQ0KPiBUcm9uZCBNeWtsZWJ1c3QNCj4gTGludXggTkZTIGNs
aWVudCBtYWludGFpbmVyDQo+IA0KPiBOZXRBcHANCj4gVHJvbmQuTXlrbGVidXN0QG5ldGFwcC5j
b20NCj4gd3d3Lm5ldGFwcC5jb20NCj4gDQo+IC0tDQo+IFRvIHVuc3Vic2NyaWJlIGZyb20gdGhp
cyBsaXN0OiBzZW5kIHRoZSBsaW5lICJ1bnN1YnNjcmliZSBsaW51eC1uZnMiIGluDQo+IHRoZSBi
b2R5IG9mIGEgbWVzc2FnZSB0byBtYWpvcmRvbW9Admdlci5rZXJuZWwub3JnDQo+IE1vcmUgbWFq
b3Jkb21vIGluZm8gYXQgIGh0dHA6Ly92Z2VyLmtlcm5lbC5vcmcvbWFqb3Jkb21vLWluZm8uaHRt
bA0KDQo=
On Wed, Nov 30, 2011 at 12:40 AM, Trond Myklebust
<[email protected]> wrote:
> On Fri, 2011-12-02 at 20:52 -0800, Peng Tao wrote:
>> Asking for layout in pg_init will always make client ask for only 4KB
>> layout in every layoutget. This way, client drops the IO size information
>> that is meaningful for MDS in handing out layout.
>>
>> In stead, if layout is not find in cache, do not send layoutget
>> at once. Wait until before issuing IO in pnfs_do_multiple_reads/writes
>> because that is where we know the real size of current IO. By telling the
>> real IO size to MDS, MDS will have a better chance to give proper layout.
>
> Why can't you just split pnfs_update_layout() into 2 sub-functions
> instead of duplicating it in private block code?
Because I wanted to differentiate between no layout header and no
cached lseg, where the pnfs_update_layout() interface is not enough to
tell the difference. Of course I can put these all into generic layer.
I will update the patchset to do it.
>
> Then call layoutget in your pg_doio() callback instead of adding a
> redundant pnfs_update_layout to
> pnfs_do_multiple_reads/pnfs_do_multiple_writes...
I have considered it before but using private pg_doio() means we will
have as much duplication of pnfs_generic_pg_read/writepages.
>
>
>> Signed-off-by: Peng Tao <[email protected]>
>> ---
>> fs/nfs/blocklayout/blocklayout.c | 54 ++++++++++++++++++++++++++++++++++++-
>> 1 files changed, 52 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
>> index 48cfac3..fd585fe 100644
>> --- a/fs/nfs/blocklayout/blocklayout.c
>> +++ b/fs/nfs/blocklayout/blocklayout.c
>> @@ -39,6 +39,7 @@
>> #include <linux/prefetch.h>
>>
>> #include "blocklayout.h"
>> +#include "../internal.h"
>>
>> #define NFSDBG_FACILITY NFSDBG_PNFS_LD
>>
>> @@ -990,14 +991,63 @@ bl_clear_layoutdriver(struct nfs_server *server)
>> return 0;
>> }
>>
>> +/* While RFC doesn't limit maximum size of layout, we better limit it ourself. */
>> +#define PNFSBLK_MAXRSIZE (0x1<<22)
>> +#define PNFSBLK_MAXWSIZE (0x1<<21)
>> +static void
>> +bl_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
>> +{
>> + struct inode *ino = pgio->pg_inode;
>> + struct pnfs_layout_hdr *lo;
>> +
>> + BUG_ON(pgio->pg_lseg != NULL);
>> + spin_lock(&ino->i_lock);
>> + lo = pnfs_find_alloc_layout(ino, req->wb_context, GFP_KERNEL);
>
> This has never been tested... It contains all sorts of bugs from
> recursive attempts to take the ino->i_lock, to sleep-under-spinlock...
The code is certainly tested... If you look into
pnfs_find_alloc_layout(), you'll see the spinlock is released inside
pnfs_find_alloc_layout() when it needs to sleep there.
pnfs_find_alloc_layout() actually asserts spin_locked at function
entrance... So we have to take the spin lock before calling it...
851 static struct pnfs_layout_hdr *
852 pnfs_find_alloc_layout(struct inode *ino,
853 struct nfs_open_context *ctx,
854 gfp_t gfp_flags)
855 {
856 struct nfs_inode *nfsi = NFS_I(ino);
857 struct pnfs_layout_hdr *new = NULL;
858
859 dprintk("%s Begin ino=%p layout=%p\n", __func__, ino,
nfsi->layout);
860
861 assert_spin_locked(&ino->i_lock);
862 if (nfsi->layout) {
863 if (test_bit(NFS_LAYOUT_DESTROYED,
&nfsi->layout->plh_flags))
864 return NULL;
865 else
866 return nfsi->layout;
867 }
868 spin_unlock(&ino->i_lock);
869 new = alloc_init_layout_hdr(ino, ctx, gfp_flags);
870 spin_lock(&ino->i_lock);
871
872 if (likely(nfsi->layout == NULL)) /* Won the race? */
873 nfsi->layout = new;
874 else
875 pnfs_free_layout_hdr(new);
876 return nfsi->layout;
877 }
--
Thanks,
Tao
>
>> + if (!lo || test_bit(lo_fail_bit(IOMODE_READ), &lo->plh_flags)) {
>> + spin_unlock(&ino->i_lock);
>> + nfs_pageio_reset_read_mds(pgio);
>> + return;
>> + }
>> +
>> + pgio->pg_bsize = PNFSBLK_MAXRSIZE;
>> + pgio->pg_lseg = pnfs_find_get_layout_locked(ino,
>> + req_offset(req),
>> + req->wb_bytes,
>> + IOMODE_READ);
>> + spin_unlock(&ino->i_lock);
>> +}
>> +
>> +static void
>> +bl_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
>> +{
>> + struct inode *ino = pgio->pg_inode;
>> + struct pnfs_layout_hdr *lo;
>> +
>> + BUG_ON(pgio->pg_lseg != NULL);
>> + spin_lock(&ino->i_lock);
>> + lo = pnfs_find_alloc_layout(ino, req->wb_context, GFP_NOFS);
>> + if (!lo || test_bit(lo_fail_bit(IOMODE_RW), &lo->plh_flags)) {
>> + spin_unlock(&ino->i_lock);
>> + nfs_pageio_reset_write_mds(pgio);
>> + return;
>> + }
>
> Ditto...
>
>> +
>> + pgio->pg_bsize = PNFSBLK_MAXWSIZE;
>> + pgio->pg_lseg = pnfs_find_get_layout_locked(ino,
>> + req_offset(req),
>> + req->wb_bytes,
>> + IOMODE_RW);
>> + spin_unlock(&ino->i_lock);
>> +}
>> +
>> static const struct nfs_pageio_ops bl_pg_read_ops = {
>> - .pg_init = pnfs_generic_pg_init_read,
>> + .pg_init = bl_pg_init_read,
>> .pg_test = pnfs_generic_pg_test,
>> .pg_doio = pnfs_generic_pg_readpages,
>> };
>>
>> static const struct nfs_pageio_ops bl_pg_write_ops = {
>> - .pg_init = pnfs_generic_pg_init_write,
>> + .pg_init = bl_pg_init_write,
>> .pg_test = pnfs_generic_pg_test,
>> .pg_doio = pnfs_generic_pg_writepages,
>> };
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>
On Tue, 2011-11-29 at 15:49 -0800, Marc Eshel wrote:
> The only 1 cent that I can add to this argument is that I was led to
> believe by you and others that Linux kernel don't add functionality on the
> client side that is not supported on the server side. Last time I made
> this point I was told that it is ok if they are of by a version or two.
> You made it clear that this is no longer true and that the Linux client
> and server are now independent of each other. I spent time working on the
> server side believing that the client and server will progress more or
> less together, disappointed to find out that it is not the case any more.
> Marc.
I don't know how to manage layout segments in a way that meets the pNFS
goals of scalability and performance, and neither you nor the pNFS spec
have told me how to do this.
If IBM wants a client that implements layout segments, then it is _your_
responsibility to:
A. Convince me that it is an implementable part of the spec.
B. Provide me with an implementation that deals all with the
concerns that I have.
IOW: nobody has ever promised IBM that the community would do all your
client work for you. EMC, NetApp and Panasas have implemented (most of)
the bits that they cared about; any bits that they haven't implemented
and that you care about are up to you.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On 11/29/2011 03:30 PM, Trond Myklebust wrote:
> On Tue, 2011-11-29 at 14:58 -0800, Boaz Harrosh wrote:
>>
>> The kind of typologies I'm talking about a single layout get ever 1GB is
>> marginal to the gain I get in deploying 100 of DSs. I have thousands of
>> DSs I want to spread the load evenly. I'm limited by the size of the layout
>> (Device info in the case of files) So I'm limited by the number of DSs I can
>> have in a layout. For large files these few devices become an hot spot all
>> the while the rest of the cluster is idle.
>
> I call "bullshit" on that whole argument...
>
> You've done sod all so far to address the problem of a client managing
sod? I don't know this word?
> layout segments for a '1000 DS' case. Are you expecting that all pNFS
> object servers out there are going to do that for you? How do I assume
> that a generic pNFS files server is going to do the same? As far as I
> know, the spec is completely moot on the whole subject.
>
What? The all segments thing is in the Generic part of the spec and is not
at all specific or even specified in the objects and blocks RFCs.
There is no layout in the spec, there are only layout_segments. Actually
what we call layout_segments, in the spec, it is simply called a layout.
The client asks for a layout (segment) and gets one. An ~0 length one
is just a special case. Without layout_get (segment) there is no optional
pnfs support.
So we are reading two different specs because to me it clearly says
layout - which is a segment.
Because the way I read it the pNFS is optional in 4.1. But if I'm a
pNFS client I need to expect layouts (segments)
> IOW: I'm not even remotely interested in your "everyday problems" if
> there are no "everyday solutions" that actually fit the generic can of
> spec worms that the pNFS layout segments open.
That I don't understand. What "spec worms that the pNFS layout segments open"
Are you seeing. Because it works pretty simple for me. And I don't see the
big difference for files. One thing I learned for the past is that when you
have concerns I should understand them and start to address them. Because
your insights are usually on the Money. If you are concerned then there is
something I should fix.
Fred told me about his COMMIT problem. And in my opinion he is doing it
the wrong way.
He is trying to consolidate commits across lo_segments. But to the contrary
he should keep it as is and bound the commit to segments boundary.
If a server is giving out all-file then above problem is mute.
but if the Server gave out segments. Then for him he anticipates all operations
segmented. Usually the servers I saw will always have different DSs on these other
segments and the hard work will not matter at all. If not then Such a server will
anticipate that the COMMITS will be finer then what could theoretically be done.
And could take that into account. In anyway that's what they'll get.
So please don't solve for the theoretical case that no one uses. (Same DSs repeat
in the next segments)
>
>>>> Two: There are already file-layout servers out there (multiple) which are
>>>> waiting for the Linux files-layout segment support, because the underline
>>>> FS requires Segments and now they do not work with the Linux client. These
>>>> are CEPH and GPFS and more.
>>>
>>> Then they will have a _long_ wait....
>>>
>>
>> OK, so now I understand. Because when I was talking to Fred before BAT and during
>> It was very very peculiar to me why he is not already done with that simple stuff.
>> Because usually Fred is such a brilliant fast programmer that I admire, and that simple
>> crap?
>>
>> But now that explains
>
> Yes. It's all a big conspiracy, and we're deliberately holding Fred's
> genius back in order to disappoint you...
>
My disappointment is that I think it is important for me that the pNFS protocol can
take a strong-hold in the HPC and cloud market (and everywhere). And for that to happen
all layout types including files must excel and shine. I'm constantly trying to architect
an HPC cluster class parallel Filesystem with Files-layout as well. But keeps getting hits
from small trivialities that make the all difference. (Another example is that EOF talk
we had the other day)
I'm a patience guy I can wait until you guys see the light.
Thanks
Heart
On Tue, 2011-11-29 at 16:20 -0800, Marc Eshel wrote:
> You ignored my main point, I was talking about the server side, my point
> was that there is nothing to build on on the serve side since the pNFS
> Linux server is not happening.
> Marc.
Sorry. I misunderstood your concern. As far as I know, the main problem
there is also one of investment: nobody has stepped up to help Bruce
write a pNFS server.
I'm less worried about this now than I was earlier, because other open
source efforts are gaining traction (see Ganesha - which is being
sponsored by IBM, and projects such as Tigran's java based pNFS server).
The other point is that we've developed client test-rigs that don't
depend on the availability of a Linux server (newpynfs and the pynfs
based proxy).
Cheers
Trond
> From:
> Trond Myklebust <[email protected]>
> To:
> Marc Eshel <[email protected]>
> Cc:
> Peng Tao <[email protected]>, [email protected], Boaz Harrosh
> <[email protected]>, Garth Gibson <[email protected]>, Fred Isaman
> <[email protected]>, [email protected], Matt Benjamin
> <[email protected]>
> Date:
> 11/29/2011 04:08 PM
> Subject:
> Re: [PATCH 0/4] nfs41: allow layoutget at pnfs_do_multiple_writes
>
>
>
> On Tue, 2011-11-29 at 15:49 -0800, Marc Eshel wrote:
> > The only 1 cent that I can add to this argument is that I was led to
> > believe by you and others that Linux kernel don't add functionality on
> the
> > client side that is not supported on the server side. Last time I made
> > this point I was told that it is ok if they are of by a version or two.
> > You made it clear that this is no longer true and that the Linux client
> > and server are now independent of each other. I spent time working on
> the
> > server side believing that the client and server will progress more or
> > less together, disappointed to find out that it is not the case any
> more.
> > Marc.
>
> I don't know how to manage layout segments in a way that meets the pNFS
> goals of scalability and performance, and neither you nor the pNFS spec
> have told me how to do this.
>
> If IBM wants a client that implements layout segments, then it is _your_
> responsibility to:
> A. Convince me that it is an implementable part of the spec.
> B. Provide me with an implementation that deals all with the
> concerns that I have.
>
> IOW: nobody has ever promised IBM that the community would do all your
> client work for you. EMC, NetApp and Panasas have implemented (most of)
> the bits that they cared about; any bits that they haven't implemented
> and that you care about are up to you.
>
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Tue, Nov 29, 2011 at 04:52:59PM -0800, Marc Eshel wrote:
> Trond Myklebust <[email protected]> wrote on 11/29/2011 04:37:05
> PM:
> >
> > Peng Tao, bhalevy, Boaz Harrosh, Garth Gibson, Fred Isaman, linux-
> > nfs, Matt Benjamin
> >
> > On Tue, 2011-11-29 at 16:20 -0800, Marc Eshel wrote:
> > > You ignored my main point, I was talking about the server side, my
> point
> > > was that there is nothing to build on on the serve side since the pNFS
>
> > > Linux server is not happening.
> > > Marc.
> >
> > Sorry. I misunderstood your concern. As far as I know, the main problem
> > there is also one of investment: nobody has stepped up to help Bruce
> > write a pNFS server.
> >
> > I'm less worried about this now than I was earlier, because other open
> > source efforts are gaining traction (see Ganesha - which is being
> > sponsored by IBM, and projects such as Tigran's java based pNFS server).
> > The other point is that we've developed client test-rigs that don't
> > depend on the availability of a Linux server (newpynfs and the pynfs
> > based proxy).
>
> You got it backward, Ganesha is getting traction precisely because the
> Linux kernel server is not happening :)
My understanding is the same as Trond's--the reason it's not happening
is because nobody is making an effort to merge it. What am I missing?
--b.
PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiBGcm9tOiBsaW51eC1uZnMtb3duZXJAdmdl
ci5rZXJuZWwub3JnIFttYWlsdG86bGludXgtbmZzLW93bmVyQHZnZXIua2VybmVsLm9yZ10gT24g
QmVoYWxmIE9mIEJvYXoNCj4gSGFycm9zaA0KPiBTZW50OiBXZWRuZXNkYXksIE5vdmVtYmVyIDMw
LCAyMDExIDU6MzQgQU0NCj4gVG86IFBlbmcgVGFvDQo+IENjOiBUcm9uZC5NeWtsZWJ1c3RAbmV0
YXBwLmNvbTsgbGludXgtbmZzQHZnZXIua2VybmVsLm9yZzsgYmhhbGV2eUB0b25pYW4uY29tDQo+
IFN1YmplY3Q6IFJlOiBbUEFUQ0ggMC80XSBuZnM0MTogYWxsb3cgbGF5b3V0Z2V0IGF0IHBuZnNf
ZG9fbXVsdGlwbGVfd3JpdGVzDQo+IA0KPiBPbiAxMi8wMi8yMDExIDA4OjUyIFBNLCBQZW5nIFRh
byB3cm90ZToNCj4gPiBJc3N1aW5nIGxheW91dGdldCBhdCAucGdfaW5pdCB3aWxsIGRyb3AgdGhl
IElPIHNpemUgaW5mb3JtYXRpb24gYW5kIGFzayBmb3IgNEtCDQo+ID4gbGF5b3V0IGV2ZXJ5IHRp
bWUuIEhvd2V2ZXIsIHRoZSBJTyBzaXplIGluZm9ybWF0aW9uIGlzIHZlcnkgdmFsdWFibGUgZm9y
IE1EUyB0bw0KPiA+IGRldGVybWluZSBob3cgbXVjaCBsYXlvdXQgaXQgc2hvdWxkIHJldHVybiB0
byBjbGllbnQuDQo+ID4NCj4gPiBUaGUgcGF0Y2hzZXQgdHJ5IHRvIGFsbG93IExEIG5vdCB0byBz
ZW5kIGxheW91dGdldCBhdCAucGdfaW5pdCBidXQgaW5zdGVhZCBhdA0KPiA+IHBuZnNfZG9fbXVs
dGlwbGVfd3JpdGVzLiBTbyB0aGF0IHJlYWwgSU8gc2l6ZSBpcyBwcmVzZXJ2ZWQgYW5kIHNlbnQg
dG8gTURTLg0KPiA+DQo+ID4gVGVzdHMgYWdhaW5zdCBhIHNlcnZlciB0aGF0IGRvZXMgbm90IGFn
Z3Jlc3NpdmVseSBwcmUtYWxsb2NhdGUgbGF5b3V0LCBzaG93cw0KPiA+IHRoYXQgdGhlIElPIHNp
emUgaW5mb3JtYW50aW9uIGlzIHJlYWxseSB1c2VmdWwgdG8gYmxvY2sgbGF5b3V0IE1EUy4NCj4g
Pg0KPiA+IFRoZSBnZW5lcmljIHBuZnMgbGF5ZXIgY2hhbmdlcyBhcmUgdHJpdmFsIHRvIGZpbGUg
bGF5b3V0IGFuZCBvYmplY3QgYXMgbG9uZyBhcw0KPiA+IHRoZXkgc3RpbGwgc2VuZCBsYXlvdXRn
ZXQgYXQgLnBnX2luaXQuDQo+ID4NCj4gDQo+IEkgaGF2ZSBhIGJldHRlciBzb2x1dGlvbiBmb3Ig
eW91ciBwcm9ibGVtLiBXaGljaCBpcyBhIG11Y2ggc21hbGxlciBhIGNoYW5nZSBhbmQNCj4gSSB0
aGluayBnaXZlcyB5b3UgbXVjaCBiZXR0ZXIgaGV1cmlzdGljcy4NCj4gDQo+IEtlZXAgdGhlIGxh
eW91dF9nZXQgZXhhY3RseSB3aGVyZSBpdCBpcywgYnV0IGluc3RlYWQgb2Ygc2VuZGluZyBQQUdF
X1NJWkUgc2VuZA0KPiB0aGUgYW1vdW50IG9mIGRpcnR5IHBhZ2VzIHlvdSBoYXZlLg0KPiANCj4g
SWYgaXQgaXMgYSBsaW5lYXIgd3JpdGUgeW91IHdpbGwgYmUgZXhhY3Qgb24gdGhlIG1vbmV5IHdp
dGggYSBzaW5nbGUgbG9fZ2V0LiBJZg0KPiBpdCBpcyBhbiBoZWF2eSByYW5kb20gd3JpdGUgdGhl
biB5b3UgbWlnaHQgbmVlZCBtb3JlIGxvX2dldHMgYW5kIHlvdSBtaWdodCBiZSBnZXR0aW5nDQo+
IHNvbWUgdW51c2VkIHNlZ21lbnRzLiBCdXQgaGVhdnkgcmFuZG9tIHdyaXRlIGlzIHJhcmUgYW5k
IHNsb3cgYW55d2F5LiBBcyBhIGZpcnN0DQo+IGFwcHJveGltYXRpb24gaXRzIGZpbmUuIChXZSBj
YW4gbGF0ZXIgZml4IHRoYXQgYXMgd2VsbCkNCkkgd291bGQgc2F5IG5vIHRvIHRoZSBhYm92ZS4u
LiBGb3Igb2JqZWN0cy9maWxlcyBNRFMsIGl0IG1heSBub3QgaHVydCBtdWNoIHRvIGFsbG9jYXRl
IHdhc3RpbmcgbGF5b3V0LiBCdXQgZm9yIGJsb2NrbGF5b3V0IHNlcnZlciwgZWFjaCBsYXlvdXQg
YWxsb2NhdGlvbiBjb25zdW1lcyBtdWNoIG1vcmUgcmVzb3VyY2UgdGhhbiBqdXN0IGdpdmluZyBv
dXQgc3RyaXBwaW5nIGluZm9ybWF0aW9uIGxpa2Ugb2JqZWN0cy9maWxlcy4gU28gaGVscGluZyBN
RFMgdG8gZG8gdGhlIGNvcnJlY3QgZGVjaXNpb24gaXMgdGhlIHJpZ2h0IHRoaW5nIGZvciBjbGll
bnQgdG8gZG8uDQoNCj4gDQo+IFRoZSAucGdfaW5pdCBpcyBkb25lIGFmdGVyIC53cml0ZV9wYWdl
cyBjYWxsIGZyb20gVkZTIGFuZCBhbGwgdGhlIHRvLWJlLXdyaXR0ZW4NCj4gcGFnZXMgYXJlIGFs
cmVhZHkgc3RhZ2VkIHRvIGJlIHdyaXR0ZW4uIFNvIHRoZXJlIHNob3VsZCBiZSBhIHdheSB0byBl
YXNpbHkgZXh0cmFjdA0KPiB0aGF0IGluZm9ybWF0aW9uLg0KPiANCj4gPiBpb3pvbmUgY21kOg0K
PiA+IC4vaW96b25lIC1yIDFtIC1zIDRHIC13IC1XIC1jIC10IDEwIC1pIDAgLUYgL21udC9pb3pv
bmUuZGF0YS4xIC9tbnQvaW96b25lLmRhdGEuMiAvbW50L2lvem9uZS5kYXRhLjMNCj4gL21udC9p
b3pvbmUuZGF0YS40IC9tbnQvaW96b25lLmRhdGEuNSAvbW50L2lvem9uZS5kYXRhLjYgL21udC9p
b3pvbmUuZGF0YS43IC9tbnQvaW96b25lLmRhdGEuOA0KPiAvbW50L2lvem9uZS5kYXRhLjkgL21u
dC9pb3pvbmUuZGF0YS4xMA0KPiA+DQo+ID4gQmVmb3IgcGF0Y2g6IGFyb3VuZCAxMk1CL3MgdGhy
b3VnaHB1dA0KPiA+IEFmdGVyIHBhdGNoOiBhcm91bmQgNzJNQi9zIHRocm91Z2hwdXQNCj4gPg0K
PiANCj4gWWVzIFllcyB0aGF0IHN0dXBpZCBCcmFpbiBkZWFkIFNlcnZlciBpcyBubyBpbmRpY2F0
aW9uIGZvciBhbnl0aGluZy4gVGhlIHNlcnZlcg0KPiBzaG91bGQga25vdyBiZXN0IGFib3V0IG9w
dGltYWwgc2l6ZXMgYW5kIGxheW91dHMuIFBsZWFzZSBkb24ndCBnaXZlIG1lIHRoYXQgc3R1ZmYN
Cj4gYWdhaW4uDQo+IA0KQWN0dWFsbHkgdGhlIHNlcnZlciBpcyBhbHJlYWR5IGRvaW5nIGxheW91
dCBwcmUtYWxsb2NhdGlvbi4gSXQgaXMganVzdCB0aGF0IGl0IGRvZXNuJ3Qga25vdyB3aGF0IGNs
aWVudCByZWFsbHkgd2FudHMgc28gY2Fubm90IGRvIGl0IHRvbyBhZ2dyZXNzaXZlbHkuIFRoYXQn
cyB3aHkgSSB3YW50ZWQgdG8gbWFrZSBjbGllbnQgdG8gc2VuZCB0aGUgUkVBTCBJTyBzaXplIGlu
Zm9ybWF0aW9uIHRvIHNlcnZlci4gRnJvbSBwZXJmb3JtYW5jZSBwZXJzcGVjdGl2ZSwgZHJvcHBp
bmcgSU8gc2l6ZSBpbmZvcm1hdGlvbiBpcyBhbHdheXMgYSBCQUQgVEhJTkcoVE0pIHRvIGRvLiAN
Cg0KPiBCdXQganVzdCBkbyB0aGUgYWJvdmUgYW5kIHlvdSdsbCBzZWUgdGhhdCBpdCBpcyBwZXJm
ZWN0Lg0KPiANCj4gQlRXIGRvbid0IGxpbWl0IHRoZSBsb19zZWdtZW50IHNpemUgYnkgdGhlIG1h
eF9pb19zaXplLiBUaGlzIGlzIHdoeSB5b3UNCj4gaGF2ZSAuYmdfdGVzdCB0byBzaWduYWwgd2hl
biBJTyBpcyBtYXhlZCBvdXQuDQo+IA0KQWN0dWFsbHkgbG9fc2VnbWVudCBzaXplIGlzIG5ldmVy
IGxpbWl0ZWQgYnkgbWF4X2lvX3NpemUuIFNlcnZlciBpcyBhbHdheXMgZW50aXRsZWQgdG8gc2Vu
ZCBsYXJnZXIgbGF5b3V0IHRoYW4gY2xpZW50IGFza3MgZnJvbS4NCg0KPiAtIFRoZSByZWFkIHNl
Z21lbnRzIHNob3VsZCBiZSBhcyBiaWcgYXMgcG9zc2libGUgKGlfc2l6ZSBsb25nKQ0KPiAtIFRo
ZSBXcml0ZSBzZWdtZW50cyBzaG91bGQgaWRlYWxseSBiZSBhcyBiaWcgYXMgdGhlIEFwcGxpY2F0
aW9uDQo+ICAgd2FudHMgdG8gd3JpdGUgdG8uIChBbW91bnQgb2YgZGlydHkgcGFnZXMgYXQgdGlt
ZSBvZiBuZnMtd3JpdGUtb3V0DQo+ICAgaXMgYSB2ZXJ5IGdvb2QgZmlyc3QgYXBwcm94aW1hdGlv
bikuDQo+IA0KPiBTbyBJIGd1ZXNzIGl0IGlzOiBJIGhhdGUgdGhlc2UgcGF0Y2hlcywgdG8gbXVj
aCBtZXNzLCB0b28gbGl0dGxlIGdvb2RuZXNzLg0KSSdtIGFmcmFpZCBJIGNhbid0IGFncmVlIHdp
dGggeW91Li4uDQoNClRoYW5rcywNClRhbw0KDQo+IA0KPiBUaGFuaw0KPiBCb2F6DQo+IA0KPiA+
IFBlbmcgVGFvICg0KToNCj4gPiAgIG5mc3Y0MTogZXhwb3J0IHBuZnNfZmluZF9hbGxvY19sYXlv
dXQNCj4gPiAgIG5mc3Y0MTogYWRkIGFuZCBleHBvcnQgcG5mc19maW5kX2dldF9sYXlvdXRfbG9j
a2VkDQo+ID4gICBuZnN2NDE6IGdldCBsc2VnIGJlZm9yZSBpc3N1ZSBMRCBJTyBpZiBwZ2lvIGRv
ZXNuJ3QgY2FycnkgbHNlZw0KPiA+ICAgcG5mc2Jsb2NrOiBkbyBhc2sgZm9yIGxheW91dCBpbiBw
Z19pbml0DQo+ID4NCj4gPiAgZnMvbmZzL2Jsb2NrbGF5b3V0L2Jsb2NrbGF5b3V0LmMgfCAgIDU0
ICsrKysrKysrKysrKysrKysrKysrKysrKysrLQ0KPiA+ICBmcy9uZnMvcG5mcy5jICAgICAgICAg
ICAgICAgICAgICB8ICAgNzQgKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKy0N
Cj4gPiAgZnMvbmZzL3BuZnMuaCAgICAgICAgICAgICAgICAgICAgfCAgICA5ICsrKysrDQo+ID4g
IDMgZmlsZXMgY2hhbmdlZCwgMTM0IGluc2VydGlvbnMoKyksIDMgZGVsZXRpb25zKC0pDQo+ID4N
Cj4gPiAtLQ0KPiA+IFRvIHVuc3Vic2NyaWJlIGZyb20gdGhpcyBsaXN0OiBzZW5kIHRoZSBsaW5l
ICJ1bnN1YnNjcmliZSBsaW51eC1uZnMiIGluDQo+ID4gdGhlIGJvZHkgb2YgYSBtZXNzYWdlIHRv
IG1ham9yZG9tb0B2Z2VyLmtlcm5lbC5vcmcNCj4gPiBNb3JlIG1ham9yZG9tbyBpbmZvIGF0ICBo
dHRwOi8vdmdlci5rZXJuZWwub3JnL21ham9yZG9tby1pbmZvLmh0bWwNCj4gDQo+IC0tDQo+IFRv
IHVuc3Vic2NyaWJlIGZyb20gdGhpcyBsaXN0OiBzZW5kIHRoZSBsaW5lICJ1bnN1YnNjcmliZSBs
aW51eC1uZnMiIGluDQo+IHRoZSBib2R5IG9mIGEgbWVzc2FnZSB0byBtYWpvcmRvbW9Admdlci5r
ZXJuZWwub3JnDQo+IE1vcmUgbWFqb3Jkb21vIGluZm8gYXQgIGh0dHA6Ly92Z2VyLmtlcm5lbC5v
cmcvbWFqb3Jkb21vLWluZm8uaHRtbA0KDQo=
On 11/29/2011 04:58 PM, Trond Myklebust wrote:
> On Tue, 2011-11-29 at 16:24 -0800, Boaz Harrosh wrote:
>> On 11/29/2011 03:30 PM, Trond Myklebust wrote:
>>> On Tue, 2011-11-29 at 14:58 -0800, Boaz Harrosh wrote:
>>
>> That I don't understand. What "spec worms that the pNFS layout segments open"
>> Are you seeing. Because it works pretty simple for me. And I don't see the
>> big difference for files. One thing I learned for the past is that when you
>> have concerns I should understand them and start to address them. Because
>> your insights are usually on the Money. If you are concerned then there is
>> something I should fix.
>
> I'm saying that if I need to manage layouts that deal with >1000 DSes,
> then I presumably need a strategy for ensuring that I return/forget
> segments that are no longer needed, and I need a strategy for ensuring
> that I always hold the segments that I do need; otherwise, I could just
> ask for a full-file layout and deal with the 1000 DSes (which is what we
> do today)...
>
Thanks for asking because now I can answer you and you will find that I'm
one step a head in some of the issues.
1. The 1000 DSes problem is separate from the segments problem. The devices
solution is on the way. The device cache is all but ready to see some
periodic scan that throws 0 used devices. We never got to it because
currently every one is testing with up to 10 devices and I'm using upto
128 devices which is just fine. The load is marginal so far.
But I promise you it is right here on my to do list. After some more
pressed problem.
Lets say one thing this subsystem is the same regardless of if the
1000 devices are refed by 1 segment or by 10 segments. Actually if
by 10 then I might get rid of some and free devices.
2. The many segments problem. There are not that many. It's more less
a segment for every 2GB so an lo_seg struct for so much IO is not
noticeable.
At the upper bound we do not have any problem because Once the system is
out of memory it will start to evict inodes. And on evict we just return
them. Also ROC Servers we forget them on close. So so far all our combined
testing did not show any real memory pressure caused by that. When shown we
can start discarding segs in an LRU fashion. There are all the mechanics
to do that, we only need to see the need.
3. The current situation is fine and working and showing great performance
for objects and blocks. And it is all in the Generic part so it should just
be the same for files. I do not see any difference.
The only BUG I see is the COMMIT and I think we know how to fix that
> My problem is that the spec certainly doesn't give me any guidance as to
> such a strategy, and I haven't seen anybody else step up to the plate.
> In fact, I strongly suspect that such a strategy is going to be very
> application specific.
>
You never asked. I'm thinking about these things all the time. Currently
we are far behind the limits of a running system. I think I'm going to
get to these limits before any one else.
My strategy is stated above LRU for devices is almost all there ref-counting
and all only the periodic timer needs to be added.
LRU for segments is more work, but is doable. But the segments count are
so low that we will not hit that problem for a long time. Before I ship
a system that will break that barrier I'll send a fix I promise.
> IOW: I don't accept that a layout-segment based solution is useful
> without some form of strategy for telling me which segments to keep and
> which to throw out when I start hitting client resource limits.
LRU. Again there are not more than a few segments per inode. It's not
1000 like devices.
> I also
> haven't seen any strategy out there for setting loga_length (as opposed
> to loga_minlength) in the LAYOUTGET requests: as far as I know that is
> going to be heavily application-dependent in the 1000-DS world.
>
Current situation is working for me. But we also are actively working to
improve it. What we want is that files-LO can enjoy the same privileges that
objects and blocks already have, in exactly the same, simple stupid but working,
way.
All your above concerns are true and interesting. I call them a rich man problems.
But they are not specific to files-LO they are generic to all of us. Current situation
satisfies us for blocks and objects. The file guys out there are jealous.
Thanks
Heart
On Tue, 2011-11-29 at 13:50 -0800, Boaz Harrosh wrote:
> On 11/29/2011 01:34 PM, Boaz Harrosh wrote:
> > But just do the above and you'll see that it is perfect.
> >
> > BTW don't limit the lo_segment size by the max_io_size. This is why you
> > have .bg_test to signal when IO is maxed out.
> >
> > - The read segments should be as big as possible (i_size long)
> > - The Write segments should ideally be as big as the Application
> > wants to write to. (Amount of dirty pages at time of nfs-write-out
> > is a very good first approximation).
> >
> > So I guess it is: I hate these patches, to much mess, too little goodness.
> >
> > Thank
> > Boaz
> >
>
> Ho and one more thing.
>
> Also Files when they will support segments and servers that request segments,
> like the CEPH server, will very much enjoy the above, .i.e: Tell me the amount
> you know you want to write.
Why would we want to add segment support to the pNFS files client???
Segments are a nuisance that multiply the amount of unnecessary chitchat
between the client and the MDS without providing any tangible
benefits...
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiBGcm9tOiBCb2F6IEhhcnJvc2ggW21haWx0
bzpiaGFycm9zaEBwYW5hc2FzLmNvbV0NCj4gU2VudDogV2VkbmVzZGF5LCBOb3ZlbWJlciAzMCwg
MjAxMSAxMTo1MSBBTQ0KPiBUbzogUGVuZywgVGFvDQo+IENjOiBiZXJnd29sZkBnbWFpbC5jb207
IFRyb25kLk15a2xlYnVzdEBuZXRhcHAuY29tOyBsaW51eC1uZnNAdmdlci5rZXJuZWwub3JnOyBi
aGFsZXZ5QHRvbmlhbi5jb20NCj4gU3ViamVjdDogUmU6IFtQQVRDSCAwLzRdIG5mczQxOiBhbGxv
dyBsYXlvdXRnZXQgYXQgcG5mc19kb19tdWx0aXBsZV93cml0ZXMNCj4gDQo+IE9uIDExLzI5LzIw
MTEgMDc6MTYgUE0sIHRhby5wZW5nQGVtYy5jb20gd3JvdGU6DQo+ID4+IC0tLS0tT3JpZ2luYWwg
TWVzc2FnZS0tLS0tDQo+ID4+IEZyb206IGxpbnV4LW5mcy1vd25lckB2Z2VyLmtlcm5lbC5vcmcg
W21haWx0bzpsaW51eC1uZnMtb3duZXJAdmdlci5rZXJuZWwub3JnXSBPbiBCZWhhbGYgT2YgQm9h
eg0KPiA+PiBIYXJyb3NoDQo+ID4+IFNlbnQ6IFdlZG5lc2RheSwgTm92ZW1iZXIgMzAsIDIwMTEg
NTozNCBBTQ0KPiA+PiBUbzogUGVuZyBUYW8NCj4gPj4gQ2M6IFRyb25kLk15a2xlYnVzdEBuZXRh
cHAuY29tOyBsaW51eC1uZnNAdmdlci5rZXJuZWwub3JnOyBiaGFsZXZ5QHRvbmlhbi5jb20NCj4g
Pj4gU3ViamVjdDogUmU6IFtQQVRDSCAwLzRdIG5mczQxOiBhbGxvdyBsYXlvdXRnZXQgYXQgcG5m
c19kb19tdWx0aXBsZV93cml0ZXMNCj4gPj4NCj4gPj4gT24gMTIvMDIvMjAxMSAwODo1MiBQTSwg
UGVuZyBUYW8gd3JvdGU6DQo+ID4+PiBJc3N1aW5nIGxheW91dGdldCBhdCAucGdfaW5pdCB3aWxs
IGRyb3AgdGhlIElPIHNpemUgaW5mb3JtYXRpb24gYW5kIGFzayBmb3IgNEtCDQo+ID4+PiBsYXlv
dXQgZXZlcnkgdGltZS4gSG93ZXZlciwgdGhlIElPIHNpemUgaW5mb3JtYXRpb24gaXMgdmVyeSB2
YWx1YWJsZSBmb3IgTURTIHRvDQo+ID4+PiBkZXRlcm1pbmUgaG93IG11Y2ggbGF5b3V0IGl0IHNo
b3VsZCByZXR1cm4gdG8gY2xpZW50Lg0KPiA+Pj4NCj4gPj4+IFRoZSBwYXRjaHNldCB0cnkgdG8g
YWxsb3cgTEQgbm90IHRvIHNlbmQgbGF5b3V0Z2V0IGF0IC5wZ19pbml0IGJ1dCBpbnN0ZWFkIGF0
DQo+ID4+PiBwbmZzX2RvX211bHRpcGxlX3dyaXRlcy4gU28gdGhhdCByZWFsIElPIHNpemUgaXMg
cHJlc2VydmVkIGFuZCBzZW50IHRvIE1EUy4NCj4gPj4+DQo+ID4+PiBUZXN0cyBhZ2FpbnN0IGEg
c2VydmVyIHRoYXQgZG9lcyBub3QgYWdncmVzc2l2ZWx5IHByZS1hbGxvY2F0ZSBsYXlvdXQsIHNo
b3dzDQo+ID4+PiB0aGF0IHRoZSBJTyBzaXplIGluZm9ybWFudGlvbiBpcyByZWFsbHkgdXNlZnVs
IHRvIGJsb2NrIGxheW91dCBNRFMuDQo+ID4+Pg0KPiA+Pj4gVGhlIGdlbmVyaWMgcG5mcyBsYXll
ciBjaGFuZ2VzIGFyZSB0cml2YWwgdG8gZmlsZSBsYXlvdXQgYW5kIG9iamVjdCBhcyBsb25nIGFz
DQo+ID4+PiB0aGV5IHN0aWxsIHNlbmQgbGF5b3V0Z2V0IGF0IC5wZ19pbml0Lg0KPiA+Pj4NCj4g
Pj4NCj4gPj4gSSBoYXZlIGEgYmV0dGVyIHNvbHV0aW9uIGZvciB5b3VyIHByb2JsZW0uIFdoaWNo
IGlzIGEgbXVjaCBzbWFsbGVyIGEgY2hhbmdlIGFuZA0KPiA+PiBJIHRoaW5rIGdpdmVzIHlvdSBt
dWNoIGJldHRlciBoZXVyaXN0aWNzLg0KPiA+Pg0KPiA+PiBLZWVwIHRoZSBsYXlvdXRfZ2V0IGV4
YWN0bHkgd2hlcmUgaXQgaXMsIGJ1dCBpbnN0ZWFkIG9mIHNlbmRpbmcgUEFHRV9TSVpFIHNlbmQN
Cj4gPj4gdGhlIGFtb3VudCBvZiBkaXJ0eSBwYWdlcyB5b3UgaGF2ZS4NCj4gPj4NCj4gPj4gSWYg
aXQgaXMgYSBsaW5lYXIgd3JpdGUgeW91IHdpbGwgYmUgZXhhY3Qgb24gdGhlIG1vbmV5IHdpdGgg
YSBzaW5nbGUgbG9fZ2V0LiBJZg0KPiA+PiBpdCBpcyBhbiBoZWF2eSByYW5kb20gd3JpdGUgdGhl
biB5b3UgbWlnaHQgbmVlZCBtb3JlIGxvX2dldHMgYW5kIHlvdSBtaWdodCBiZSBnZXR0aW5nDQo+
ID4+IHNvbWUgdW51c2VkIHNlZ21lbnRzLiBCdXQgaGVhdnkgcmFuZG9tIHdyaXRlIGlzIHJhcmUg
YW5kIHNsb3cgYW55d2F5LiBBcyBhIGZpcnN0DQo+ID4+IGFwcHJveGltYXRpb24gaXRzIGZpbmUu
IChXZSBjYW4gbGF0ZXIgZml4IHRoYXQgYXMgd2VsbCkNCj4gPg0KPiA+IEkgd291bGQgc2F5IG5v
IHRvIHRoZSBhYm92ZS4uLiBGb3Igb2JqZWN0cy9maWxlcyBNRFMsIGl0IG1heSBub3QgaHVydA0K
PiA+IG11Y2ggdG8gYWxsb2NhdGUgd2FzdGluZyBsYXlvdXQuIEJ1dCBmb3IgYmxvY2tsYXlvdXQg
c2VydmVyLCBlYWNoDQo+ID4gbGF5b3V0IGFsbG9jYXRpb24gY29uc3VtZXMgbXVjaCBtb3JlIHJl
c291cmNlIHRoYW4ganVzdCBnaXZpbmcgb3V0DQo+ID4gc3RyaXBwaW5nIGluZm9ybWF0aW9uIGxp
a2Ugb2JqZWN0cy9maWxlcy4NCj4gDQo+IFRoYXQncyBmaW5lLCBmb3IgdGhlIGxpbmVhciBJTyBs
aWtlIGlvem9uZSBiZWxvdyBteSB3YXkgaXMganVzdCB0aGUgc2FtZQ0KPiBhcyB5b3Vycy4gRm9y
IHRoZSByYW5kb20gSU8gSSdtIG5vdCBzdXJlIGhvdyBtdWNoIGJldHRlciB3aWxsIHlvdXIgc29s
dXRpb24NCj4gYmUuIE5vdCBieSBtdWNoLg0KQXMgSSBzYWlkLCBmb3IgcmFuZG9tIElPLCB0aGVy
ZSB3aWxsIGJlIG11Y2ggZGlzayBzcGFjZSB3YXN0aW5nIG9uIGJsb2NrbGF5b3V0IHNlcnZlciBp
biB5b3VyIHNvbHV0aW9uLiBUaGF0J3Mgd2h5IEkgZG9uJ3QgYWdyZWUgd2l0aCBpdC4gQmVzaWRl
cywgaW4gc29tZSBjYXNlcywgc2VydmVyIG1heSBiZSBwdXQgaW4gYSBoYXJkIHBvc2l0aW9uIHRv
IGRldGVybWluZSBpZiB0aGUgSU8gaXMgcmVhbGx5IGxpbmVhciBvciBpbiBmYWN0IHJhbmRvbSBp
biB5b3VyIHNvbHV0aW9uLg0KDQo+IA0KPiBJIHdhbnQgYSBzb2x1dGlvbiBmb3Igb2JqZWN0cyBh
cyB3ZWxsLiBCdXQgSSBjYW5ub3QgdXNlIHlvdXJzIGJlY2F1c2UgSSBuZWVkDQo+IGEgbGF5b3V0
IGJlZm9yZSB0aGUgZmluYWwgcmVxdWVzdCBjb25zb2xpZGF0aW9uLiBTb2x2ZSBteSBwcm9ibGVt
IHRvby4NCj4gDQpJIHVzZWQgdG8gbG9vayBhdCBvYmplY3RzIGF0IHNvbWUgdGltZSwgYW5kIGFz
IEkgcmVtZW1iZXIsIGl0IG5lZWQgbWF4IGlvIHNpemUgaW4gZWFjaCBsc2VnIHRvIGZpbmlzaCAu
cGdfdGVzdC4gSXMgdGhpcyB0aGUgcmVhc29uIHlvdSBuZWVkIGEgbGF5b3V0IGJlZm9yZSB0aGUg
ZmluYWwgcmVxdWVzdCBjb25zb2xpZGF0aW9uPyBEb2VzIHRoZSB2YWx1ZSB2YXJ5IGluIGRpZmZl
cmVudCBsc2VnPw0KDQo+ID4gU28gaGVscGluZyBNRFMgdG8gZG8gdGhlDQo+ID4gY29ycmVjdCBk
ZWNpc2lvbiBpcyB0aGUgcmlnaHQgdGhpbmcgZm9yIGNsaWVudCB0byBkby4NCj4gDQo+IEkgYWdy
ZWUuIEFsbCBJJ20gc2F5aW5nIGlzIHRoYXQgdGhlcmUgaXMgYXZhaWxhYmxlIGluZm9ybWF0aW9u
IGF0IHRoZSB0aW1lDQo+IG9mIC5wZ19pbml0IHRvIHNlbmQgdGhhdCBudW1iZXIganVzdCBmaW5l
LiBIYXZlIHlvdSBsb29rZWQ/IGl0J3MgYWxsIHRoZXJlDQo+IE5GUyBjb3JlIGNhbiB0ZWxsIHlv
dSBob3cgbWFueSBwYWdlcyBoYXZlIHBhc3NlZCAtPndyaXRlX3BhZ2VzLg0KPiANCkl0IG9ubHkg
dGVsbHMgYSBmYWtlIElPIHNpemUgZm9yIHRoZSBudW1iZXIgb2YgZGlydHkgcGFnZXMuIE5vIG9u
ZSBjYW4gcHJvbWlzZSB0aGVzZSBwYWdlcyBhcmUgYWxsIGNvbnRpbnVvdXMuIEluc3RlYWQsIGlm
IHdlIGNhbiBnaXZlIGEgcmVhbCBJTyBzaXplLCB3aHkgcmVmdXNlIHRvIGRvIGl0Pw0KDQo+ID4N
Cj4gPj4NCj4gPj4gVGhlIC5wZ19pbml0IGlzIGRvbmUgYWZ0ZXIgLndyaXRlX3BhZ2VzIGNhbGwg
ZnJvbSBWRlMgYW5kIGFsbCB0aGUgdG8tYmUtd3JpdHRlbg0KPiA+PiBwYWdlcyBhcmUgYWxyZWFk
eSBzdGFnZWQgdG8gYmUgd3JpdHRlbi4gU28gdGhlcmUgc2hvdWxkIGJlIGEgd2F5IHRvIGVhc2ls
eSBleHRyYWN0DQo+ID4+IHRoYXQgaW5mb3JtYXRpb24uDQo+ID4+DQo+ID4+PiBpb3pvbmUgY21k
Og0KPiA+Pj4gLi9pb3pvbmUgLXIgMW0gLXMgNEcgLXcgLVcgLWMgLXQgMTAgLWkgMCAtRiAvbW50
L2lvem9uZS5kYXRhLjEgL21udC9pb3pvbmUuZGF0YS4yDQo+IC9tbnQvaW96b25lLmRhdGEuMw0K
PiA+PiAvbW50L2lvem9uZS5kYXRhLjQgL21udC9pb3pvbmUuZGF0YS41IC9tbnQvaW96b25lLmRh
dGEuNiAvbW50L2lvem9uZS5kYXRhLjcgL21udC9pb3pvbmUuZGF0YS44DQo+ID4+IC9tbnQvaW96
b25lLmRhdGEuOSAvbW50L2lvem9uZS5kYXRhLjEwDQo+ID4+Pg0KPiA+Pj4gQmVmb3IgcGF0Y2g6
IGFyb3VuZCAxMk1CL3MgdGhyb3VnaHB1dA0KPiA+Pj4gQWZ0ZXIgcGF0Y2g6IGFyb3VuZCA3Mk1C
L3MgdGhyb3VnaHB1dA0KPiA+Pj4NCj4gPj4NCj4gPj4gWWVzIFllcyB0aGF0IHN0dXBpZCBCcmFp
biBkZWFkIFNlcnZlciBpcyBubyBpbmRpY2F0aW9uIGZvciBhbnl0aGluZy4gVGhlIHNlcnZlcg0K
PiA+PiBzaG91bGQga25vdyBiZXN0IGFib3V0IG9wdGltYWwgc2l6ZXMgYW5kIGxheW91dHMuIFBs
ZWFzZSBkb24ndCBnaXZlIG1lIHRoYXQgc3R1ZmYNCj4gPj4gYWdhaW4uDQo+ID4+DQo+ID4gQWN0
dWFsbHkgdGhlIHNlcnZlciBpcyBhbHJlYWR5IGRvaW5nIGxheW91dCBwcmUtYWxsb2NhdGlvbi4g
SXQgaXMNCj4gPiBqdXN0IHRoYXQgaXQgZG9lc24ndCBrbm93IHdoYXQgY2xpZW50IHJlYWxseSB3
YW50cyBzbyBjYW5ub3QgZG8gaXQNCj4gPiB0b28gYWdncmVzc2l2ZWx5LiBUaGF0J3Mgd2h5IEkg
d2FudGVkIHRvIG1ha2UgY2xpZW50IHRvIHNlbmQgdGhlIFJFQUwNCj4gPiBJTyBzaXplIGluZm9y
bWF0aW9uIHRvIHNlcnZlci4gRnJvbSBwZXJmb3JtYW5jZSBwZXJzcGVjdGl2ZSwgZHJvcHBpbmcN
Cj4gPiBJTyBzaXplIGluZm9ybWF0aW9uIGlzIGFsd2F5cyBhIEJBRCBUSElORyhUTSkgdG8gZG8u
DQo+IA0KPiBJIHRvdGFsbHkgYWdyZWUuIEkgd2FudCBpdCB0b28uIFRoZXJlIGlzIGEgd2F5IHRv
IGRvIGl0IGluIHBnX2luaXQgdGltZQ0KPiBhbGwgdGhlIGluZm9ybWF0aW9uIGlzIHRoZXJlIGl0
IG9ubHkgbmVlZHMgdG8gYmUgcGFzc2VkIHRvIGxheW91dF9nZXQuDQo+IA0KVGhpcyAiYWxsIHRo
ZSBpbmZvcm1hdGlvbiBpcyB0aGVyZSIgaXMgbGlrZWx5IHRvIGJlIGZhbHNlLCB1bmxlc3MgeW91
IG9ubHkgZGVhbCB3aXRoIHNlcXVlbnRpYWwgSU8uLi4NCg0KPiA+DQo+ID4+IEJUVyBkb24ndCBs
aW1pdCB0aGUgbG9fc2VnbWVudCBzaXplIGJ5IHRoZSBtYXhfaW9fc2l6ZS4gVGhpcyBpcyB3aHkg
eW91DQo+ID4+IGhhdmUgLmJnX3Rlc3QgdG8gc2lnbmFsIHdoZW4gSU8gaXMgbWF4ZWQgb3V0Lg0K
PiA+Pg0KPiA+IEFjdHVhbGx5IGxvX3NlZ21lbnQgc2l6ZSBpcyBuZXZlciBsaW1pdGVkIGJ5IG1h
eF9pb19zaXplLiBTZXJ2ZXIgaXMNCj4gPiBhbHdheXMgZW50aXRsZWQgdG8gc2VuZCBsYXJnZXIg
bGF5b3V0IHRoYW4gY2xpZW50IGFza3MgZnJvbS4NCj4gDQo+IFlvdSBtaXNzIG15IHBvaW50LiBJ
biB5b3VyIGxhc3QgcGF0Y2ggeW91IGhhdmUNCj4gDQo+ICsvKiBXaGlsZSBSRkMgZG9lc24ndCBs
aW1pdCBtYXhpbXVtIHNpemUgb2YgbGF5b3V0LCB3ZSBiZXR0ZXIgbGltaXQgaXQgb3Vyc2VsZi4g
Ki8NCj4gKyNkZWZpbmUgUE5GU0JMS19NQVhSU0laRSAoMHgxPDwyMikNCj4gKyNkZWZpbmUgUE5G
U0JMS19NQVhXU0laRSAoMHgxPDwyMSkNCj4gDQo+IEkgZG9uJ3Qga25vdyB3aGF0IHRoZXNlIG51
bWJlciBtZWFuIGJ1dCB0aGV5IGtpbmQgb2YgbG9vayBsaWtlIElPIGxpbWl0cw0KPiBhbmQgbm90
IHNlZ21lbnQgbGltaXRzLiBJZiBJJ20gd3JvbmcgdGhlbiBzb3JyeS4gV2hhdCBhcmUgdGhlc2Ug
bnVtYmVycz8NCj4gDQpZZXMsIHRoZXNlIGFyZSBpbyBzaXplIGxpbWl0LiBJIHNob3VsZCBqdXN0
IHJlbW92ZSB0aGUgY29tbWVudHMgdGhhdCBhcmUgdG90YWxseSBtaXNsZWFkaW5nLg0KDQo+IElm
IGEgY2xpZW50IGhhcyAxRyBvZiBkaXJ0eSBwYWdlcyB0byB3cml0ZSB3aHkgbm90IGdldCB0aGUg
ZnVsbCBsYXlvdXQNCj4gYXQgb25jZS4gV2hlcmUgZG9lcyB0aGUgNE0gbGltaXQgY29tZXMgZnJv
bT8NCj4gDQpQbGVhc2Ugbm90ZSB0aGF0IGJsb2NrIGxheW91dCBzZXJ2ZXIgY2Fubm90IGp1c3Qg
Z2l2ZSBmdWxsIGZpbGUgc3RyaXBwaW5nIGluZm9ybWF0aW9uLiBXaGVuIHdlIGFzayBmb3IgMUdC
IGxheW91dCwgbW9zdCB0aW1lcyB3ZSBnZXQgbXVjaCBsZXNzIGFueXdheS4gQW5kIHRoZSB2YWx1
ZSBvZiAyTUIgY29tZXMgZnJvbSBvdXIgZXhwZXJpZW5jZSB3aXRoIE1QRlMsIHRvIGFsbG93IHRo
ZSBiYWxhbmNlIGJldHdlZW4gc2VydmVyIHByZXNzdXJlIGFuZCBjbGllbnQgcGVyZm9ybWFuY2Uu
DQoNCkFsc28sIGN1cnJlbnRseSB3ZSByZXRyeSBNRFMgb24gYSBwZXIgbmZzX3JlYWQvd3JpdGVf
ZGF0YSBiYXNpcy4gSXQgaXMgbXVjaCBlYXNpZXIgdG8gaGFuZGxlIDJNQiByYXRoZXIgdGhhbiAx
R0IgZGlydHkgcGFnZXMuIEkgbm90aWNlIHRoYXQgaXQgbWF5IG5vdCBiZSBhbiBpc3N1ZSBmb3Ig
b2JqZWN0cyBhcyB5b3UgaGF2ZSBtYXggSU8gc2l6ZSBsaW1pdCBvbiBldmVyeSBsc2VnLg0KDQo+
ID4NCj4gPj4gLSBUaGUgcmVhZCBzZWdtZW50cyBzaG91bGQgYmUgYXMgYmlnIGFzIHBvc3NpYmxl
IChpX3NpemUgbG9uZykNCj4gPj4gLSBUaGUgV3JpdGUgc2VnbWVudHMgc2hvdWxkIGlkZWFsbHkg
YmUgYXMgYmlnIGFzIHRoZSBBcHBsaWNhdGlvbg0KPiA+PiAgIHdhbnRzIHRvIHdyaXRlIHRvLiAo
QW1vdW50IG9mIGRpcnR5IHBhZ2VzIGF0IHRpbWUgb2YgbmZzLXdyaXRlLW91dA0KPiA+PiAgIGlz
IGEgdmVyeSBnb29kIGZpcnN0IGFwcHJveGltYXRpb24pLg0KPiA+Pg0KPiA+PiBTbyBJIGd1ZXNz
IGl0IGlzOiBJIGhhdGUgdGhlc2UgcGF0Y2hlcywgdG8gbXVjaCBtZXNzLCB0b28gbGl0dGxlIGdv
b2RuZXNzLg0KPiA+IEknbSBhZnJhaWQgSSBjYW4ndCBhZ3JlZSB3aXRoIHlvdS4uLg0KPiA+DQo+
IA0KPiBTdXJlIHlvdSBkby4gWW91IGRpZCB0aGUgaGFyZCB3b3JrIGFuZCBub3cgSSdtIHRlbGxp
bmcgeW91IHlvdSBuZWVkIHRvIGRvDQo+IG1vcmUgd29yay4gSSdtIHNvcnJ5IGZvciB0aGF0LiBC
dXQgSSB3YW50IGEgc29sdXRpb24gZm9yIG1lIGFuZCBJIHRoaW5rDQo+IHRoZXJlIGlzIGEgc2lt
cGxlIHNvbHV0aW9uIHRoYXQgd2lsbCBzYXRpc2Z5IGJvdGggb2Ygb3VyIG5lZWRzLg0KPiANClNv
cnJ5IGJ1dCBJIGRvbid0IHRoaW5rIHlvdXIgc29sdXRpb24gaXMgZ29vZCBlbm91Z2ggdG8gYWRk
cmVzcyBibG9ja2xheW91dCdzIGNvbmNlcm5zLiBJdCB3b3VsZCBiZSBncmVhdCBpZiB3ZSBjYW4g
dXRpbGl6ZSB0aGUgc2FtZSBzb2x1dGlvbi4gQnV0IHdoZW4gd2UgZG8gbmVlZCwgSSB0aGluayBp
dCBwZXJmZWN0bHkgcmVhc29uYWJsZSB0byBsZXQgYmxvY2tsYXlvdXQgYW5kIG9iamVjdCBsYXlv
dXQgaGF2ZSBkaWZmZXJlbnQgc3RyYXRlZ3kgb24gbGF5b3V0Z2V0LCBiYXNlZCBvbiB0aGUgZmFj
dCB0aGF0IG91ciBzZXJ2ZXJzIGhhdmUgZGlmZmVyZW50IGJlaGF2aW9yIG9uIGxheW91dCBhbGxv
Y2F0aW9uLiBBbmQgYWxsb3dpbmcgdGhpcyBraW5kIG9mIGRpZmZlcmVuY2UgaXMgZXhhY3RseSB3
aGF0IHN0cmN1dCBuZnNfcGFnZWlvX29wcyBzZXJ2ZXMgZm9yLg0KDQpXaGF0IGRvIHlvdSB0aGlu
az8NCg0KVGhhbmtzLA0KVGFvDQo+IFNvcnJ5IGZvciB0aGF0LiBJZiBJIGhhZCB0aW1lIEkgd291
bGQgZG8gaXQuIE9ubHkgSSBoYXZlIGhhcmRlciByZWFsIEJVR1MNCj4gdG8gZml4IG9uIG15IHBs
YXRlLg0KPiANCj4gSWYgeW91IGNvdWxkIGxvb2sgaW50byBpdCBJdCB3aWxsIGJlIHZlcnkgbmlj
ZS4gQW5kIHRoYW5rIHlvdSBmb3Igd29ya2luZw0KPiBvbiB0aGlzIHNvIGZhci4gT25seSB0aGF0
IGN1cnJlbnQgc29sdXRpb24gaXMgbm90IG9wdGltYWwgYW5kIEkgd2lsbCBuZWVkDQo+IHRvIGNv
bnRpbnVlIG9uIGl0IGxhdGVyLCBpZiBsZWZ0IGFzIGlzLg0KPiANCj4gPiBUaGFua3MsDQo+ID4g
VGFvDQo+ID4NCj4gDQo+IFRoYW5rcw0KDQo=
On Tue, 2011-11-29 at 19:48 -0500, Matt W. Benjamin wrote:
> Let me clarify: there are files based servers, our Ceph on Ganesha server is one, which have file allocation not satisfied by whole-file layouts. I would think that demonstrating this would be sufficient to get support from the Linux client to support appropriate segment management, at any rate, if someone is willing to write and support the required code, or already has. One of those alternatives is certainly the case. By the way, we wrote generic pNFS, pNFS files support for Ganesha and, with a big dose of help from Panasas, are taking it to merge.
I really want more than that. Please see the reply that I just sent to
Boaz: I need a client strategy for managing partial layout segments in
the case where holding a whole-file layout is not acceptable. Otherwise,
what we have now should be sufficient...
Trond
> Matt
>
> ----- "Matt W. Benjamin" <[email protected]> wrote:
>
> > That would be pretty disappointing. However, based on previous
> > interactions, my belief would be, the
> > Linux client will do what can be shown empirically to work better, or
> > more correctly.
> >
> > Matt
> >
> > ----- "Trond Myklebust" <[email protected]> wrote:
> >
> > > On Tue, 2011-11-29 at 14:40 -0800, Boaz Harrosh wrote:
> > > > On 11/29/2011 01:57 PM, Trond Myklebust wrote:
> > > > >> Also Files when they will support segments and servers that
> > > request segments,
> > > > >> like the CEPH server, will very much enjoy the above, .i.e:
> > Tell
> > > me the amount
> > > > >> you know you want to write.
> > > > >
> > > > > Why would we want to add segment support to the pNFS files
> > > client???
> > > > > Segments are a nuisance that multiply the amount of unnecessary
> > > chitchat
> > > > > between the client and the MDS without providing any tangible
> > > > > benefits...
> > > > >
> > > >
> > > > Your kidding right?
> > > >
> > > > One: it is mandated by the Standard, This is not an option. So a
> > > perfectly
> > > > Standard complaint server is not Supported by Linux because
> > we
> > > don't see
> > > > the point.
> > >
> > > Bollocks.. Nothing is "mandated by the Standard". If the server
> > > doesn't
> > > give us a full layout, then we fall back to write through MDS. Why
> > > dick
> > > around with crap that SLOWS YOU DOWN.
> > >
> > > > Two: There are already file-layout servers out there (multiple)
> > > which are
> > > > waiting for the Linux files-layout segment support, because
> > the
> > > underline
> > > > FS requires Segments and now they do not work with the Linux
> > > client. These
> > > > are CEPH and GPFS and more.
> > >
> > > Then they will have a _long_ wait....
> > >
> > > Trond
> > >
> > > --
> > > Trond Myklebust
> > > Linux NFS client maintainer
> > >
> > > NetApp
> > > [email protected]
> > > http://www.netapp.com
> >
> > --
> >
> > Matt Benjamin
> >
> > The Linux Box
> > 206 South Fifth Ave. Suite 150
> > Ann Arbor, MI 48104
> >
> > http://linuxbox.com
> >
> > tel. 734-761-4689
> > fax. 734-769-8938
> > cel. 734-216-5309
>
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
This gives LD option not to ask for layout in pg_init.
Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/pnfs.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 46 insertions(+), 0 deletions(-)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 734e670..c8dc0b1 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1254,6 +1254,7 @@ pnfs_do_multiple_writes(struct nfs_pageio_descriptor *desc, struct list_head *he
struct nfs_write_data *data;
const struct rpc_call_ops *call_ops = desc->pg_rpc_callops;
struct pnfs_layout_segment *lseg = desc->pg_lseg;
+ const bool has_lseg = !!lseg;
desc->pg_lseg = NULL;
while (!list_empty(head)) {
@@ -1262,7 +1263,29 @@ pnfs_do_multiple_writes(struct nfs_pageio_descriptor *desc, struct list_head *he
data = list_entry(head->next, struct nfs_write_data, list);
list_del_init(&data->list);
+ if (!has_lseg) {
+ struct nfs_page *req = nfs_list_entry(data->pages.next);
+ __u64 length = data->npages << PAGE_CACHE_SHIFT;
+
+ lseg = pnfs_update_layout(desc->pg_inode,
+ req->wb_context,
+ req_offset(req),
+ length,
+ IOMODE_RW,
+ GFP_NOFS);
+ if (!lseg || length > (lseg->pls_range.length)) {
+ put_lseg(lseg);
+ lseg = NULL;
+ pnfs_write_through_mds(desc,data);
+ continue;
+ }
+ }
+
trypnfs = pnfs_try_to_write_data(data, call_ops, lseg, how);
+ if (!has_lseg) {
+ put_lseg(lseg);
+ lseg = NULL;
+ }
if (trypnfs == PNFS_NOT_ATTEMPTED)
pnfs_write_through_mds(desc, data);
}
@@ -1350,6 +1373,7 @@ pnfs_do_multiple_reads(struct nfs_pageio_descriptor *desc, struct list_head *hea
struct nfs_read_data *data;
const struct rpc_call_ops *call_ops = desc->pg_rpc_callops;
struct pnfs_layout_segment *lseg = desc->pg_lseg;
+ const bool has_lseg = !!lseg;
desc->pg_lseg = NULL;
while (!list_empty(head)) {
@@ -1358,7 +1382,29 @@ pnfs_do_multiple_reads(struct nfs_pageio_descriptor *desc, struct list_head *hea
data = list_entry(head->next, struct nfs_read_data, list);
list_del_init(&data->list);
+ if (!has_lseg) {
+ struct nfs_page *req = nfs_list_entry(data->pages.next);
+ __u64 length = data->npages << PAGE_CACHE_SHIFT;
+
+ lseg = pnfs_update_layout(desc->pg_inode,
+ req->wb_context,
+ req_offset(req),
+ length,
+ IOMODE_READ,
+ GFP_KERNEL);
+ if (!lseg || length > lseg->pls_range.length) {
+ put_lseg(lseg);
+ lseg = NULL;
+ pnfs_read_through_mds(desc, data);
+ continue;
+ }
+ }
+
trypnfs = pnfs_try_to_read_data(data, call_ops, lseg);
+ if (!has_lseg) {
+ put_lseg(lseg);
+ lseg = NULL;
+ }
if (trypnfs == PNFS_NOT_ATTEMPTED)
pnfs_read_through_mds(desc, data);
}
--
1.7.1.262.g5ef3d
On 11/29/2011 04:37 PM, Trond Myklebust wrote:
> On Tue, 2011-11-29 at 16:20 -0800, Marc Eshel wrote:
>> You ignored my main point, I was talking about the server side, my point
>> was that there is nothing to build on on the serve side since the pNFS
>> Linux server is not happening.
>> Marc.
>
> Sorry. I misunderstood your concern. As far as I know, the main problem
> there is also one of investment: nobody has stepped up to help Bruce
> write a pNFS server.
>
I use "OUR" Linux pNFS Server every day, all day. It's there it is written
and it kicks us. We've been working on that for 5 Years now.
The only holding factor is that no one wants to do the final step and
fight the VFS guys to actually push it into Linus.
But it is there and alive in full open-source glory in Benny's tree.
> I'm less worried about this now than I was earlier, because other open
> source efforts are gaining traction (see Ganesha - which is being
> sponsored by IBM, and projects such as Tigran's java based pNFS server).
> The other point is that we've developed client test-rigs that don't
> depend on the availability of a Linux server (newpynfs and the pynfs
> based proxy).
>
For me I'm testing with the linux Server and it is just there and works
well for me. Better than any other solution I have right now.
> Cheers
> Trond
>
Ciao
Heart
On 11/29/2011 06:07 PM, Trond Myklebust wrote:
>>
>> 1. The 1000 DSes problem is separate from the segments problem. The devices
>
> Errr... That was the problem that you used to justify the need for a
> full implementation of layout segments in the pNFS files case...
>
What I do not understand? I said what?
>> solution is on the way. The device cache is all but ready to see some
>> periodic scan that throws 0 used devices. We never got to it because
>> currently every one is testing with up to 10 devices and I'm using upto
>> 128 devices which is just fine. The load is marginal so far.
>> But I promise you it is right here on my to do list. After some more
>> pressed problem.
>> Lets say one thing this subsystem is the same regardless of if the
>> 1000 devices are refed by 1 segment or by 10 segments. Actually if
>> by 10 then I might get rid of some and free devices.
>>
>> 2. The many segments problem. There are not that many. It's more less
>> a segment for every 2GB so an lo_seg struct for so much IO is not
>> noticeable.
>
> Where do you get that 2GB number from?
>
It's just the numbers that I saw and used. I'm just giving you an example
usage. The numbers guys are looking for are not seg for every 4K but seg
every Giga. That's what I'm saying. When you asses the problem you should
attack the expected and current behavior.
When a smart ass Server comes and serves 4k segments and all it's Clients go
OOM, how long that Server will stay in business? I don't care about him
I care about a properly set balance and that is what we arrived at both in
Panasas and else where.
>> At the upper bound we do not have any problem because Once the system is
>> out of memory it will start to evict inodes. And on evict we just return
>> them. Also ROC Servers we forget them on close. So so far all our combined
>> testing did not show any real memory pressure caused by that. When shown we
>> can start discarding segs in an LRU fashion. There are all the mechanics
>> to do that, we only need to see the need.
>
> It's not necessarily that simple: if you are already low on memory, them
> LAYOUTGET and GETDEVICE will require you to allocate more memory in
> order to get round to cleaning those dirty pages.
> There are plenty of situations where the majority of dirty pages belong
> to a single file. If that file is one of your 1000 DS-files and it
> requires you to allocate 1000 new device table entries...
>
No!!! That is the all-file layout problem. In a balanced and segmented
system. You don't. You start by getting a small number of devices corresponding
to the first seg. send the IO, when the IO returns, given memory pressure
you can free the segment, and it's ref-ed devices and continue with the next
seg. You can do this all day visiting all the 1000 devices with never having
more then 10 at a time.
The ratios are fine. For every 1GB of dirty pages I have one layout and 10
devices. It's marginal and expected memory needs for IO. Should I start with
the block layer scsi layer iscsi LLD networking stack, they all need more
memory to clear memory. If the system makes sure that dirty pages pressure
starts soon enough the system should be fine.
>> 3. The current situation is fine and working and showing great performance
>> for objects and blocks. And it is all in the Generic part so it should just
>> be the same for files. I do not see any difference.
>>
>> The only BUG I see is the COMMIT and I think we know how to fix that
>
> I haven't seen any performance numbers for either, so I can't comment.
>
890MB single 10G client single stream.
3.6G 16 clients N x N from a 4.0G theoretical storage limit.
Please Believe me nice numbers. It is all very balanced. 2G segments
10 devices each segment. Smooth as silk
>>
>> LRU. Again there are not more than a few segments per inode. It's not
>> 1000 like devices.
>
> Again, the problem for files shouldn't be the number of segments, it is
> number of devices.
>
Right! And the all-file layout makes it worse. With segments the DSs can
be de-refed early making room for new devices. It is all a matter of keeping
your numbers balanced. When you get it wrong your client performance drops.
All we (the client need to care) is that we don't crash and do the right
thing. If a server returns a 1000 DSs segment then we return E-RESOURCE.
Hell the xdr buffer for get device info will be much to small long before
that. But is the server returns 10 devices at a time that can be discarded
before the next segment then we are fine, right?
>>
>> All your above concerns are true and interesting. I call them a rich man problems.
>> But they are not specific to files-LO they are generic to all of us. Current situation
>> satisfies us for blocks and objects. The file guys out there are jealous.
>
> I'm not convinced that the problems are the same. objects, and
> particularly blocks, appear to treat layout segments as a form of byte
> range lock. There is no reason for a pNFS files server to do so.
>
Trond this is not fair. You are back to your old self again. A files layout
guy just told you that it's cluster's data layout cannot be described in a
single deviceinfo+layout and his topology requires segmented topology. Locks or
no locks. that's beside the issue.
In objects only for RAID5 it is true what you say because you cannot have
two clients writing the same stripe. But for RAID0 there is no such restriction.
For a long time I served all-file until I had a system with more than 21 objects
the 21 objects is the limit of the layout_get buffer from client. So now I serve
10 device segments at a time, which gives me a nice balance. And actually works
much better than the old all-file way. It is liter Both on the Server implementation
and on the Client.
You are dodging our problem. There are true servers out there that have typologies
that needs segments in exactly the type of numbers that I'm talking about. The
current implementation is just fine. All they want is the restriction lifted and
the COMMIT bug fixed. They do not ask for anything else, more.
And soon enough I will demonstrate to you a (virtual) 1000 devices file working
just fine. Once I get that devices-cache LRU in place.
Lets say that the RAID0 objects behavior is identical to the files-LO which is
RAID0 only. (No recalls on stripe conflicts) so if it works very nice for
objects I don't see why it should have problems for files?
If I send you a patch that fixes the COMMIT problem in files layout
will you consider it?
Heart
On Tue, 2011-11-29 at 14:58 -0800, Boaz Harrosh wrote:
> On 11/29/2011 02:47 PM, Trond Myklebust wrote:
> > On Tue, 2011-11-29 at 14:40 -0800, Boaz Harrosh wrote:
> >> On 11/29/2011 01:57 PM, Trond Myklebust wrote:
> >>>> Also Files when they will support segments and servers that request segments,
> >>>> like the CEPH server, will very much enjoy the above, .i.e: Tell me the amount
> >>>> you know you want to write.
> >>>
> >>> Why would we want to add segment support to the pNFS files client???
> >>> Segments are a nuisance that multiply the amount of unnecessary chitchat
> >>> between the client and the MDS without providing any tangible
> >>> benefits...
> >>>
> >>
> >> Your kidding right?
> >>
> >> One: it is mandated by the Standard, This is not an option. So a perfectly
> >> Standard complaint server is not Supported by Linux because we don't see
> >> the point.
> >
> > Bollocks.. Nothing is "mandated by the Standard". If the server doesn't
> > give us a full layout, then we fall back to write through MDS. Why dick
> > around with crap that SLOWS YOU DOWN.
> >
>
> NO! MAKE YOU FASTER.
>
> The kind of typologies I'm talking about a single layout get ever 1GB is
> marginal to the gain I get in deploying 100 of DSs. I have thousands of
> DSs I want to spread the load evenly. I'm limited by the size of the layout
> (Device info in the case of files) So I'm limited by the number of DSs I can
> have in a layout. For large files these few devices become an hot spot all
> the while the rest of the cluster is idle.
I call "bullshit" on that whole argument...
You've done sod all so far to address the problem of a client managing
layout segments for a '1000 DS' case. Are you expecting that all pNFS
object servers out there are going to do that for you? How do I assume
that a generic pNFS files server is going to do the same? As far as I
know, the spec is completely moot on the whole subject.
IOW: I'm not even remotely interested in your "everyday problems" if
there are no "everyday solutions" that actually fit the generic can of
spec worms that the pNFS layout segments open.
> >> Two: There are already file-layout servers out there (multiple) which are
> >> waiting for the Linux files-layout segment support, because the underline
> >> FS requires Segments and now they do not work with the Linux client. These
> >> are CEPH and GPFS and more.
> >
> > Then they will have a _long_ wait....
> >
>
> OK, so now I understand. Because when I was talking to Fred before BAT and during
> It was very very peculiar to me why he is not already done with that simple stuff.
> Because usually Fred is such a brilliant fast programmer that I admire, and that simple
> crap?
>
> But now that explains
Yes. It's all a big conspiracy, and we're deliberately holding Fred's
genius back in order to disappoint you...
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
You ignored my main point, I was talking about the server side, my point
was that there is nothing to build on on the serve side since the pNFS
Linux server is not happening.
Marc.
From:
Trond Myklebust <[email protected]>
To:
Marc Eshel <[email protected]>
Cc:
Peng Tao <[email protected]>, [email protected], Boaz Harrosh
<[email protected]>, Garth Gibson <[email protected]>, Fred Isaman
<[email protected]>, [email protected], Matt Benjamin
<[email protected]>
Date:
11/29/2011 04:08 PM
Subject:
Re: [PATCH 0/4] nfs41: allow layoutget at pnfs_do_multiple_writes
On Tue, 2011-11-29 at 15:49 -0800, Marc Eshel wrote:
> The only 1 cent that I can add to this argument is that I was led to
> believe by you and others that Linux kernel don't add functionality on
the
> client side that is not supported on the server side. Last time I made
> this point I was told that it is ok if they are of by a version or two.
> You made it clear that this is no longer true and that the Linux client
> and server are now independent of each other. I spent time working on
the
> server side believing that the client and server will progress more or
> less together, disappointed to find out that it is not the case any
more.
> Marc.
I don't know how to manage layout segments in a way that meets the pNFS
goals of scalability and performance, and neither you nor the pNFS spec
have told me how to do this.
If IBM wants a client that implements layout segments, then it is _your_
responsibility to:
A. Convince me that it is an implementable part of the spec.
B. Provide me with an implementation that deals all with the
concerns that I have.
IOW: nobody has ever promised IBM that the community would do all your
client work for you. EMC, NetApp and Panasas have implemented (most of)
the bits that they cared about; any bits that they haven't implemented
and that you care about are up to you.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
Asking for layout in pg_init will always make client ask for only 4KB
layout in every layoutget. This way, client drops the IO size information
that is meaningful for MDS in handing out layout.
In stead, if layout is not find in cache, do not send layoutget
at once. Wait until before issuing IO in pnfs_do_multiple_reads/writes
because that is where we know the real size of current IO. By telling the
real IO size to MDS, MDS will have a better chance to give proper layout.
Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 54 ++++++++++++++++++++++++++++++++++++-
1 files changed, 52 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 48cfac3..fd585fe 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -39,6 +39,7 @@
#include <linux/prefetch.h>
#include "blocklayout.h"
+#include "../internal.h"
#define NFSDBG_FACILITY NFSDBG_PNFS_LD
@@ -990,14 +991,63 @@ bl_clear_layoutdriver(struct nfs_server *server)
return 0;
}
+/* While RFC doesn't limit maximum size of layout, we better limit it ourself. */
+#define PNFSBLK_MAXRSIZE (0x1<<22)
+#define PNFSBLK_MAXWSIZE (0x1<<21)
+static void
+bl_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
+{
+ struct inode *ino = pgio->pg_inode;
+ struct pnfs_layout_hdr *lo;
+
+ BUG_ON(pgio->pg_lseg != NULL);
+ spin_lock(&ino->i_lock);
+ lo = pnfs_find_alloc_layout(ino, req->wb_context, GFP_KERNEL);
+ if (!lo || test_bit(lo_fail_bit(IOMODE_READ), &lo->plh_flags)) {
+ spin_unlock(&ino->i_lock);
+ nfs_pageio_reset_read_mds(pgio);
+ return;
+ }
+
+ pgio->pg_bsize = PNFSBLK_MAXRSIZE;
+ pgio->pg_lseg = pnfs_find_get_layout_locked(ino,
+ req_offset(req),
+ req->wb_bytes,
+ IOMODE_READ);
+ spin_unlock(&ino->i_lock);
+}
+
+static void
+bl_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
+{
+ struct inode *ino = pgio->pg_inode;
+ struct pnfs_layout_hdr *lo;
+
+ BUG_ON(pgio->pg_lseg != NULL);
+ spin_lock(&ino->i_lock);
+ lo = pnfs_find_alloc_layout(ino, req->wb_context, GFP_NOFS);
+ if (!lo || test_bit(lo_fail_bit(IOMODE_RW), &lo->plh_flags)) {
+ spin_unlock(&ino->i_lock);
+ nfs_pageio_reset_write_mds(pgio);
+ return;
+ }
+
+ pgio->pg_bsize = PNFSBLK_MAXWSIZE;
+ pgio->pg_lseg = pnfs_find_get_layout_locked(ino,
+ req_offset(req),
+ req->wb_bytes,
+ IOMODE_RW);
+ spin_unlock(&ino->i_lock);
+}
+
static const struct nfs_pageio_ops bl_pg_read_ops = {
- .pg_init = pnfs_generic_pg_init_read,
+ .pg_init = bl_pg_init_read,
.pg_test = pnfs_generic_pg_test,
.pg_doio = pnfs_generic_pg_readpages,
};
static const struct nfs_pageio_ops bl_pg_write_ops = {
- .pg_init = pnfs_generic_pg_init_write,
+ .pg_init = bl_pg_init_write,
.pg_test = pnfs_generic_pg_test,
.pg_doio = pnfs_generic_pg_writepages,
};
--
1.7.1.262.g5ef3d
On Tue, 2011-11-29 at 17:46 -0800, Boaz Harrosh wrote:
> On 11/29/2011 04:58 PM, Trond Myklebust wrote:
> > On Tue, 2011-11-29 at 16:24 -0800, Boaz Harrosh wrote:
> >> On 11/29/2011 03:30 PM, Trond Myklebust wrote:
> >>> On Tue, 2011-11-29 at 14:58 -0800, Boaz Harrosh wrote:
> >>
> >> That I don't understand. What "spec worms that the pNFS layout segments open"
> >> Are you seeing. Because it works pretty simple for me. And I don't see the
> >> big difference for files. One thing I learned for the past is that when you
> >> have concerns I should understand them and start to address them. Because
> >> your insights are usually on the Money. If you are concerned then there is
> >> something I should fix.
> >
> > I'm saying that if I need to manage layouts that deal with >1000 DSes,
> > then I presumably need a strategy for ensuring that I return/forget
> > segments that are no longer needed, and I need a strategy for ensuring
> > that I always hold the segments that I do need; otherwise, I could just
> > ask for a full-file layout and deal with the 1000 DSes (which is what we
> > do today)...
> >
>
> Thanks for asking because now I can answer you and you will find that I'm
> one step a head in some of the issues.
>
> 1. The 1000 DSes problem is separate from the segments problem. The devices
Errr... That was the problem that you used to justify the need for a
full implementation of layout segments in the pNFS files case...
> solution is on the way. The device cache is all but ready to see some
> periodic scan that throws 0 used devices. We never got to it because
> currently every one is testing with up to 10 devices and I'm using upto
> 128 devices which is just fine. The load is marginal so far.
> But I promise you it is right here on my to do list. After some more
> pressed problem.
> Lets say one thing this subsystem is the same regardless of if the
> 1000 devices are refed by 1 segment or by 10 segments. Actually if
> by 10 then I might get rid of some and free devices.
>
> 2. The many segments problem. There are not that many. It's more less
> a segment for every 2GB so an lo_seg struct for so much IO is not
> noticeable.
Where do you get that 2GB number from?
> At the upper bound we do not have any problem because Once the system is
> out of memory it will start to evict inodes. And on evict we just return
> them. Also ROC Servers we forget them on close. So so far all our combined
> testing did not show any real memory pressure caused by that. When shown we
> can start discarding segs in an LRU fashion. There are all the mechanics
> to do that, we only need to see the need.
It's not necessarily that simple: if you are already low on memory, them
LAYOUTGET and GETDEVICE will require you to allocate more memory in
order to get round to cleaning those dirty pages.
There are plenty of situations where the majority of dirty pages belong
to a single file. If that file is one of your 1000 DS-files and it
requires you to allocate 1000 new device table entries...
> 3. The current situation is fine and working and showing great performance
> for objects and blocks. And it is all in the Generic part so it should just
> be the same for files. I do not see any difference.
>
> The only BUG I see is the COMMIT and I think we know how to fix that
I haven't seen any performance numbers for either, so I can't comment.
> > My problem is that the spec certainly doesn't give me any guidance as to
> > such a strategy, and I haven't seen anybody else step up to the plate.
> > In fact, I strongly suspect that such a strategy is going to be very
> > application specific.
> >
>
> You never asked. I'm thinking about these things all the time. Currently
> we are far behind the limits of a running system. I think I'm going to
> get to these limits before any one else.
>
> My strategy is stated above LRU for devices is almost all there ref-counting
> and all only the periodic timer needs to be added.
> LRU for segments is more work, but is doable. But the segments count are
> so low that we will not hit that problem for a long time. Before I ship
> a system that will break that barrier I'll send a fix I promise.
As far as pNFS files is concerned, the memory pressure should be driven
by the number of devices (i.e. DSes).
> > IOW: I don't accept that a layout-segment based solution is useful
> > without some form of strategy for telling me which segments to keep and
> > which to throw out when I start hitting client resource limits.
>
> LRU. Again there are not more than a few segments per inode. It's not
> 1000 like devices.
Again, the problem for files shouldn't be the number of segments, it is
number of devices.
> > I also
> > haven't seen any strategy out there for setting loga_length (as opposed
> > to loga_minlength) in the LAYOUTGET requests: as far as I know that is
> > going to be heavily application-dependent in the 1000-DS world.
> >
>
> Current situation is working for me. But we also are actively working to
> improve it. What we want is that files-LO can enjoy the same privileges that
> objects and blocks already have, in exactly the same, simple stupid but working,
> way.
>
> All your above concerns are true and interesting. I call them a rich man problems.
> But they are not specific to files-LO they are generic to all of us. Current situation
> satisfies us for blocks and objects. The file guys out there are jealous.
I'm not convinced that the problems are the same. objects, and
particularly blocks, appear to treat layout segments as a form of byte
range lock. There is no reason for a pNFS files server to do so.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Fri, 2011-12-02 at 20:52 -0800, Peng Tao wrote:
> Asking for layout in pg_init will always make client ask for only 4KB
> layout in every layoutget. This way, client drops the IO size information
> that is meaningful for MDS in handing out layout.
>
> In stead, if layout is not find in cache, do not send layoutget
> at once. Wait until before issuing IO in pnfs_do_multiple_reads/writes
> because that is where we know the real size of current IO. By telling the
> real IO size to MDS, MDS will have a better chance to give proper layout.
Why can't you just split pnfs_update_layout() into 2 sub-functions
instead of duplicating it in private block code?
Then call layoutget in your pg_doio() callback instead of adding a
redundant pnfs_update_layout to
pnfs_do_multiple_reads/pnfs_do_multiple_writes...
> Signed-off-by: Peng Tao <[email protected]>
> ---
> fs/nfs/blocklayout/blocklayout.c | 54 ++++++++++++++++++++++++++++++++++++-
> 1 files changed, 52 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index 48cfac3..fd585fe 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -39,6 +39,7 @@
> #include <linux/prefetch.h>
>
> #include "blocklayout.h"
> +#include "../internal.h"
>
> #define NFSDBG_FACILITY NFSDBG_PNFS_LD
>
> @@ -990,14 +991,63 @@ bl_clear_layoutdriver(struct nfs_server *server)
> return 0;
> }
>
> +/* While RFC doesn't limit maximum size of layout, we better limit it ourself. */
> +#define PNFSBLK_MAXRSIZE (0x1<<22)
> +#define PNFSBLK_MAXWSIZE (0x1<<21)
> +static void
> +bl_pg_init_read(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
> +{
> + struct inode *ino = pgio->pg_inode;
> + struct pnfs_layout_hdr *lo;
> +
> + BUG_ON(pgio->pg_lseg != NULL);
> + spin_lock(&ino->i_lock);
> + lo = pnfs_find_alloc_layout(ino, req->wb_context, GFP_KERNEL);
This has never been tested... It contains all sorts of bugs from
recursive attempts to take the ino->i_lock, to sleep-under-spinlock...
> + if (!lo || test_bit(lo_fail_bit(IOMODE_READ), &lo->plh_flags)) {
> + spin_unlock(&ino->i_lock);
> + nfs_pageio_reset_read_mds(pgio);
> + return;
> + }
> +
> + pgio->pg_bsize = PNFSBLK_MAXRSIZE;
> + pgio->pg_lseg = pnfs_find_get_layout_locked(ino,
> + req_offset(req),
> + req->wb_bytes,
> + IOMODE_READ);
> + spin_unlock(&ino->i_lock);
> +}
> +
> +static void
> +bl_pg_init_write(struct nfs_pageio_descriptor *pgio, struct nfs_page *req)
> +{
> + struct inode *ino = pgio->pg_inode;
> + struct pnfs_layout_hdr *lo;
> +
> + BUG_ON(pgio->pg_lseg != NULL);
> + spin_lock(&ino->i_lock);
> + lo = pnfs_find_alloc_layout(ino, req->wb_context, GFP_NOFS);
> + if (!lo || test_bit(lo_fail_bit(IOMODE_RW), &lo->plh_flags)) {
> + spin_unlock(&ino->i_lock);
> + nfs_pageio_reset_write_mds(pgio);
> + return;
> + }
Ditto...
> +
> + pgio->pg_bsize = PNFSBLK_MAXWSIZE;
> + pgio->pg_lseg = pnfs_find_get_layout_locked(ino,
> + req_offset(req),
> + req->wb_bytes,
> + IOMODE_RW);
> + spin_unlock(&ino->i_lock);
> +}
> +
> static const struct nfs_pageio_ops bl_pg_read_ops = {
> - .pg_init = pnfs_generic_pg_init_read,
> + .pg_init = bl_pg_init_read,
> .pg_test = pnfs_generic_pg_test,
> .pg_doio = pnfs_generic_pg_readpages,
> };
>
> static const struct nfs_pageio_ops bl_pg_write_ops = {
> - .pg_init = pnfs_generic_pg_init_write,
> + .pg_init = bl_pg_init_write,
> .pg_test = pnfs_generic_pg_test,
> .pg_doio = pnfs_generic_pg_writepages,
> };
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
It tries to find the lseg from local cache but not retrive layout from server.
Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/pnfs.c | 25 +++++++++++++++++++++++++
fs/nfs/pnfs.h | 5 +++++
2 files changed, 30 insertions(+), 0 deletions(-)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 3be29c7..734e670 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -933,6 +933,31 @@ pnfs_find_lseg(struct pnfs_layout_hdr *lo,
}
/*
+ * Find and reference lseg with ino->i_lock held.
+ */
+struct pnfs_layout_segment *
+pnfs_find_get_layout_locked(struct inode *ino,
+ loff_t pos,
+ u64 count,
+ enum pnfs_iomode iomode)
+{
+ struct pnfs_layout_segment *lseg = NULL;
+ struct pnfs_layout_range range = {
+ .iomode = iomode,
+ .offset = pos,
+ .length = count,
+ };
+
+ if (NFS_I(ino)->layout == NULL)
+ goto out;
+
+ lseg = pnfs_find_lseg(NFS_I(ino)->layout, &range);
+out:
+ return lseg;
+}
+EXPORT_SYMBOL_GPL(pnfs_find_get_layout_locked);
+
+/*
* Layout segment is retreived from the server if not cached.
* The appropriate layout segment is referenced and returned to the caller.
*/
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 9614ac9..0c55fc1 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -213,6 +213,11 @@ struct pnfs_layout_hdr *
pnfs_find_alloc_layout(struct inode *ino,
struct nfs_open_context *ctx,
gfp_t gfp_flags);
+struct pnfs_layout_segment *
+pnfs_find_get_layout_locked(struct inode *ino,
+ loff_t pos,
+ u64 count,
+ enum pnfs_iomode iomode);
void nfs4_deviceid_mark_client_invalid(struct nfs_client *clp);
--
1.7.1.262.g5ef3d
Trond Myklebust <[email protected]> wrote on 11/29/2011 04:37:05
PM:
>
> Peng Tao, bhalevy, Boaz Harrosh, Garth Gibson, Fred Isaman, linux-
> nfs, Matt Benjamin
>
> On Tue, 2011-11-29 at 16:20 -0800, Marc Eshel wrote:
> > You ignored my main point, I was talking about the server side, my
point
> > was that there is nothing to build on on the serve side since the pNFS
> > Linux server is not happening.
> > Marc.
>
> Sorry. I misunderstood your concern. As far as I know, the main problem
> there is also one of investment: nobody has stepped up to help Bruce
> write a pNFS server.
>
> I'm less worried about this now than I was earlier, because other open
> source efforts are gaining traction (see Ganesha - which is being
> sponsored by IBM, and projects such as Tigran's java based pNFS server).
> The other point is that we've developed client test-rigs that don't
> depend on the availability of a Linux server (newpynfs and the pynfs
> based proxy).
You got it backward, Ganesha is getting traction precisely because the
Linux kernel server is not happening :)
Sorry, I did not mean to change the topic of this thread, go back to
addressing Boaz's concern.
Marc.
>
> Cheers
> Trond
>
> > From:
> > Trond Myklebust <[email protected]>
> > To:
> > Marc Eshel <[email protected]>
> > Cc:
> > Peng Tao <[email protected]>, [email protected], Boaz Harrosh
> > <[email protected]>, Garth Gibson <[email protected]>, Fred Isaman
> > <[email protected]>, [email protected], Matt Benjamin
> > <[email protected]>
> > Date:
> > 11/29/2011 04:08 PM
> > Subject:
> > Re: [PATCH 0/4] nfs41: allow layoutget at pnfs_do_multiple_writes
> >
> >
> >
> > On Tue, 2011-11-29 at 15:49 -0800, Marc Eshel wrote:
> > > The only 1 cent that I can add to this argument is that I was led to
> > > believe by you and others that Linux kernel don't add functionality
on
> > the
> > > client side that is not supported on the server side. Last time I
made
> > > this point I was told that it is ok if they are of by a version or
two.
> > > You made it clear that this is no longer true and that the Linux
client
> > > and server are now independent of each other. I spent time working
on
> > the
> > > server side believing that the client and server will progress more
or
> > > less together, disappointed to find out that it is not the case any
> > more.
> > > Marc.
> >
> > I don't know how to manage layout segments in a way that meets the
pNFS
> > goals of scalability and performance, and neither you nor the pNFS
spec
> > have told me how to do this.
> >
> > If IBM wants a client that implements layout segments, then it is
_your_
> > responsibility to:
> > A. Convince me that it is an implementable part of the spec.
> > B. Provide me with an implementation that deals all with the
> > concerns that I have.
> >
> > IOW: nobody has ever promised IBM that the community would do all your
> > client work for you. EMC, NetApp and Panasas have implemented (most
of)
> > the bits that they cared about; any bits that they haven't implemented
> > and that you care about are up to you.
> >
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>
On Wed, 2011-11-30 at 01:25 +0800, Peng Tao wrote:
> On Wed, Nov 30, 2011 at 12:40 AM, Trond Myklebust
> <[email protected]> wrote:
> > On Fri, 2011-12-02 at 20:52 -0800, Peng Tao wrote:
> >> Asking for layout in pg_init will always make client ask for only 4KB
> >> layout in every layoutget. This way, client drops the IO size information
> >> that is meaningful for MDS in handing out layout.
> >>
> >> In stead, if layout is not find in cache, do not send layoutget
> >> at once. Wait until before issuing IO in pnfs_do_multiple_reads/writes
> >> because that is where we know the real size of current IO. By telling the
> >> real IO size to MDS, MDS will have a better chance to give proper layout.
> >
> > Why can't you just split pnfs_update_layout() into 2 sub-functions
> > instead of duplicating it in private block code?
> Because I wanted to differentiate between no layout header and no
> cached lseg, where the pnfs_update_layout() interface is not enough to
> tell the difference. Of course I can put these all into generic layer.
> I will update the patchset to do it.
>
> >
> > Then call layoutget in your pg_doio() callback instead of adding a
> > redundant pnfs_update_layout to
> > pnfs_do_multiple_reads/pnfs_do_multiple_writes...
> I have considered it before but using private pg_doio() means we will
> have as much duplication of pnfs_generic_pg_read/writepages.
Why? All you need to do is send the layoutget, and then call the
existing pnfs_generic_pg_read/writepages?
The difference here is that you're adding that step into
pnfs_generic_pg_read/writepages in patch 3/4. Basically you are adding
block-specific code into an otherwise generic function instead of doing
it cleanly in the block-specific callbacks.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On 2011-11-30 07:05, [email protected] wrote:
>> -----Original Message-----
>> From: Boaz Harrosh [mailto:[email protected]]
>> Sent: Wednesday, November 30, 2011 11:51 AM
>> To: Peng, Tao
>> Cc: [email protected]; [email protected]; [email protected]; [email protected]
>> Subject: Re: [PATCH 0/4] nfs41: allow layoutget at pnfs_do_multiple_writes
>>
>> On 11/29/2011 07:16 PM, [email protected] wrote:
>>>> -----Original Message-----
>>>> From: [email protected] [mailto:[email protected]] On Behalf Of Boaz
>>>> Harrosh
>>>> Sent: Wednesday, November 30, 2011 5:34 AM
>>>> To: Peng Tao
>>>> Cc: [email protected]; [email protected]; [email protected]
>>>> Subject: Re: [PATCH 0/4] nfs41: allow layoutget at pnfs_do_multiple_writes
>>>>
>>>> On 12/02/2011 08:52 PM, Peng Tao wrote:
>>>>> Issuing layoutget at .pg_init will drop the IO size information and ask for 4KB
>>>>> layout every time. However, the IO size information is very valuable for MDS to
>>>>> determine how much layout it should return to client.
>>>>>
>>>>> The patchset try to allow LD not to send layoutget at .pg_init but instead at
>>>>> pnfs_do_multiple_writes. So that real IO size is preserved and sent to MDS.
>>>>>
>>>>> Tests against a server that does not aggressively pre-allocate layout, shows
>>>>> that the IO size informantion is really useful to block layout MDS.
>>>>>
>>>>> The generic pnfs layer changes are trival to file layout and object as long as
>>>>> they still send layoutget at .pg_init.
>>>>>
>>>>
>>>> I have a better solution for your problem. Which is a much smaller a change and
>>>> I think gives you much better heuristics.
>>>>
>>>> Keep the layout_get exactly where it is, but instead of sending PAGE_SIZE send
>>>> the amount of dirty pages you have.
>>>>
>>>> If it is a linear write you will be exact on the money with a single lo_get. If
>>>> it is an heavy random write then you might need more lo_gets and you might be getting
>>>> some unused segments. But heavy random write is rare and slow anyway. As a first
>>>> approximation its fine. (We can later fix that as well)
>>>
>>> I would say no to the above... For objects/files MDS, it may not hurt
>>> much to allocate wasting layout. But for blocklayout server, each
>>> layout allocation consumes much more resource than just giving out
>>> stripping information like objects/files.
>>
>> That's fine, for the linear IO like iozone below my way is just the same
>> as yours. For the random IO I'm not sure how much better will your solution
>> be. Not by much.
> As I said, for random IO, there will be much disk space wasting on blocklayout server in your solution. That's why I don't agree with it. Besides, in some cases, server may be put in a hard position to determine if the IO is really linear or in fact random in your solution.
>
>>
>> I want a solution for objects as well. But I cannot use yours because I need
>> a layout before the final request consolidation. Solve my problem too.
>>
> I used to look at objects at some time, and as I remember, it need max io size in each lseg to finish .pg_test. Is this the reason you need a layout before the final request consolidation? Does the value vary in different lseg?
>
>>> So helping MDS to do the
>>> correct decision is the right thing for client to do.
>>
>> I agree. All I'm saying is that there is available information at the time
>> of .pg_init to send that number just fine. Have you looked? it's all there
>> NFS core can tell you how many pages have passed ->write_pages.
>>
> It only tells a fake IO size for the number of dirty pages. No one can promise these pages are all continuous. Instead, if we can give a real IO size, why refuse to do it?
>
I agree. The client need not blindly extend loga_length.
Simply ask for what you need and let the server optimize the result.
Benny
>>>
>>>>
>>>> The .pg_init is done after .write_pages call from VFS and all the to-be-written
>>>> pages are already staged to be written. So there should be a way to easily extract
>>>> that information.
>>>>
>>>>> iozone cmd:
>>>>> ./iozone -r 1m -s 4G -w -W -c -t 10 -i 0 -F /mnt/iozone.data.1 /mnt/iozone.data.2
>> /mnt/iozone.data.3
>>>> /mnt/iozone.data.4 /mnt/iozone.data.5 /mnt/iozone.data.6 /mnt/iozone.data.7 /mnt/iozone.data.8
>>>> /mnt/iozone.data.9 /mnt/iozone.data.10
>>>>>
>>>>> Befor patch: around 12MB/s throughput
>>>>> After patch: around 72MB/s throughput
>>>>>
>>>>
>>>> Yes Yes that stupid Brain dead Server is no indication for anything. The server
>>>> should know best about optimal sizes and layouts. Please don't give me that stuff
>>>> again.
>>>>
>>> Actually the server is already doing layout pre-allocation. It is
>>> just that it doesn't know what client really wants so cannot do it
>>> too aggressively. That's why I wanted to make client to send the REAL
>>> IO size information to server. From performance perspective, dropping
>>> IO size information is always a BAD THING(TM) to do.
>>
>> I totally agree. I want it too. There is a way to do it in pg_init time
>> all the information is there it only needs to be passed to layout_get.
>>
> This "all the information is there" is likely to be false, unless you only deal with sequential IO...
>
>>>
>>>> BTW don't limit the lo_segment size by the max_io_size. This is why you
>>>> have .bg_test to signal when IO is maxed out.
>>>>
>>> Actually lo_segment size is never limited by max_io_size. Server is
>>> always entitled to send larger layout than client asks from.
>>
>> You miss my point. In your last patch you have
>>
>> +/* While RFC doesn't limit maximum size of layout, we better limit it ourself. */
>> +#define PNFSBLK_MAXRSIZE (0x1<<22)
>> +#define PNFSBLK_MAXWSIZE (0x1<<21)
>>
>> I don't know what these number mean but they kind of look like IO limits
>> and not segment limits. If I'm wrong then sorry. What are these numbers?
>>
> Yes, these are io size limit. I should just remove the comments that are totally misleading.
>
>> If a client has 1G of dirty pages to write why not get the full layout
>> at once. Where does the 4M limit comes from?
>>
> Please note that block layout server cannot just give full file stripping information. When we ask for 1GB layout, most times we get much less anyway. And the value of 2MB comes from our experience with MPFS, to allow the balance between server pressure and client performance.
>
> Also, currently we retry MDS on a per nfs_read/write_data basis. It is much easier to handle 2MB rather than 1GB dirty pages. I notice that it may not be an issue for objects as you have max IO size limit on every lseg.
>
>>>
>>>> - The read segments should be as big as possible (i_size long)
>>>> - The Write segments should ideally be as big as the Application
>>>> wants to write to. (Amount of dirty pages at time of nfs-write-out
>>>> is a very good first approximation).
>>>>
>>>> So I guess it is: I hate these patches, to much mess, too little goodness.
>>> I'm afraid I can't agree with you...
>>>
>>
>> Sure you do. You did the hard work and now I'm telling you you need to do
>> more work. I'm sorry for that. But I want a solution for me and I think
>> there is a simple solution that will satisfy both of our needs.
>>
> Sorry but I don't think your solution is good enough to address blocklayout's concerns. It would be great if we can utilize the same solution. But when we do need, I think it perfectly reasonable to let blocklayout and object layout have different strategy on layoutget, based on the fact that our servers have different behavior on layout allocation. And allowing this kind of difference is exactly what strcut nfs_pageio_ops serves for.
>
> What do you think?
>
> Thanks,
> Tao
>> Sorry for that. If I had time I would do it. Only I have harder real BUGS
>> to fix on my plate.
>>
>> If you could look into it It will be very nice. And thank you for working
>> on this so far. Only that current solution is not optimal and I will need
>> to continue on it later, if left as is.
>>
>>> Thanks,
>>> Tao
>>>
>>
>> Thanks
>
> N�����r��y���b�X��ǧv�^�){.n�+����{���"��^n�r��z���h����&���G���h�(�階�ݢj"���m�����z�ޖ���f���h���~�mml==
I think it's reasonable that we propose a strategy or strategies for consideration, let's us do so and meet back here.
Thanks,
Matt
----- "Trond Myklebust" <[email protected]> wrote:
> On Tue, 2011-11-29 at 19:48 -0500, Matt W. Benjamin wrote:
> > Let me clarify: there are files based servers, our Ceph on Ganesha
> server is one, which have file allocation not satisfied by whole-file
> layouts. I would think that demonstrating this would be sufficient to
> get support from the Linux client to support appropriate segment
> management, at any rate, if someone is willing to write and support
> the required code, or already has. One of those alternatives is
> certainly the case. By the way, we wrote generic pNFS, pNFS files
> support for Ganesha and, with a big dose of help from Panasas, are
> taking it to merge.
>
> I really want more than that. Please see the reply that I just sent
> to
> Boaz: I need a client strategy for managing partial layout segments
> in
> the case where holding a whole-file layout is not acceptable.
> Otherwise,
> what we have now should be sufficient...
>
> Trond
>
> > Matt
> >
> > ----- "Matt W. Benjamin" <[email protected]> wrote:
> >
> > > That would be pretty disappointing. However, based on previous
> > > interactions, my belief would be, the
> > > Linux client will do what can be shown empirically to work better,
> or
> > > more correctly.
> > >
> > > Matt
> > >
> > > ----- "Trond Myklebust" <[email protected]> wrote:
> > >
> > > > On Tue, 2011-11-29 at 14:40 -0800, Boaz Harrosh wrote:
> > > > > On 11/29/2011 01:57 PM, Trond Myklebust wrote:
> > > > > >> Also Files when they will support segments and servers
> that
> > > > request segments,
> > > > > >> like the CEPH server, will very much enjoy the above,
> .i.e:
> > > Tell
> > > > me the amount
> > > > > >> you know you want to write.
> > > > > >
> > > > > > Why would we want to add segment support to the pNFS files
> > > > client???
> > > > > > Segments are a nuisance that multiply the amount of
> unnecessary
> > > > chitchat
> > > > > > between the client and the MDS without providing any
> tangible
> > > > > > benefits...
> > > > > >
> > > > >
> > > > > Your kidding right?
> > > > >
> > > > > One: it is mandated by the Standard, This is not an option. So
> a
> > > > perfectly
> > > > > Standard complaint server is not Supported by Linux
> because
> > > we
> > > > don't see
> > > > > the point.
> > > >
> > > > Bollocks.. Nothing is "mandated by the Standard". If the server
> > > > doesn't
> > > > give us a full layout, then we fall back to write through MDS.
> Why
> > > > dick
> > > > around with crap that SLOWS YOU DOWN.
> > > >
> > > > > Two: There are already file-layout servers out there
> (multiple)
> > > > which are
> > > > > waiting for the Linux files-layout segment support,
> because
> > > the
> > > > underline
> > > > > FS requires Segments and now they do not work with the
> Linux
> > > > client. These
> > > > > are CEPH and GPFS and more.
> > > >
> > > > Then they will have a _long_ wait....
> > > >
> > > > Trond
> > > >
> > > > --
> > > > Trond Myklebust
> > > > Linux NFS client maintainer
> > > >
> > > > NetApp
> > > > [email protected]
> > > > http://www.netapp.com
> > >
> > > --
> > >
> > > Matt Benjamin
> > >
> > > The Linux Box
> > > 206 South Fifth Ave. Suite 150
> > > Ann Arbor, MI 48104
> > >
> > > http://linuxbox.com
> > >
> > > tel. 734-761-4689
> > > fax. 734-769-8938
> > > cel. 734-216-5309
> >
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
--
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI 48104
http://linuxbox.com
tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309
On Tue, 2011-11-29 at 14:40 -0800, Boaz Harrosh wrote:
> On 11/29/2011 01:57 PM, Trond Myklebust wrote:
> >> Also Files when they will support segments and servers that request segments,
> >> like the CEPH server, will very much enjoy the above, .i.e: Tell me the amount
> >> you know you want to write.
> >
> > Why would we want to add segment support to the pNFS files client???
> > Segments are a nuisance that multiply the amount of unnecessary chitchat
> > between the client and the MDS without providing any tangible
> > benefits...
> >
>
> Your kidding right?
>
> One: it is mandated by the Standard, This is not an option. So a perfectly
> Standard complaint server is not Supported by Linux because we don't see
> the point.
Bollocks.. Nothing is "mandated by the Standard". If the server doesn't
give us a full layout, then we fall back to write through MDS. Why dick
around with crap that SLOWS YOU DOWN.
> Two: There are already file-layout servers out there (multiple) which are
> waiting for the Linux files-layout segment support, because the underline
> FS requires Segments and now they do not work with the Linux client. These
> are CEPH and GPFS and more.
Then they will have a _long_ wait....
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On 11/29/2011 01:57 PM, Trond Myklebust wrote:
>> Also Files when they will support segments and servers that request segments,
>> like the CEPH server, will very much enjoy the above, .i.e: Tell me the amount
>> you know you want to write.
>
> Why would we want to add segment support to the pNFS files client???
> Segments are a nuisance that multiply the amount of unnecessary chitchat
> between the client and the MDS without providing any tangible
> benefits...
>
Your kidding right?
One: it is mandated by the Standard, This is not an option. So a perfectly
Standard complaint server is not Supported by Linux because we don't see
the point.
Two: There are already file-layout servers out there (multiple) which are
waiting for the Linux files-layout segment support, because the underline
FS requires Segments and now they do not work with the Linux client. These
are CEPH and GPFS and more.
The reason Netapp don't see a point is not a factor. I have not seen a netapp
server doing a top 100 HPC cluster either.
If the server hates the "unnecessary chitchat between the client and the MDS"
It can serve all-file layout and that is that. But some servers don't have
that privilege. Real big storage clusters with 1000thnds of DSs the data layout
might only be properly described by segments.
Let the fight begin. All the segmented file-layout servers out there. Now is
the time to take a stand.
Heart
So that layout driver can access layout header when there is none.
Signed-off-by: Peng Tao <[email protected]>
---
fs/nfs/pnfs.c | 3 ++-
fs/nfs/pnfs.h | 4 ++++
2 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index baf7353..3be29c7 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -848,7 +848,7 @@ alloc_init_layout_hdr(struct inode *ino,
return lo;
}
-static struct pnfs_layout_hdr *
+struct pnfs_layout_hdr *
pnfs_find_alloc_layout(struct inode *ino,
struct nfs_open_context *ctx,
gfp_t gfp_flags)
@@ -875,6 +875,7 @@ pnfs_find_alloc_layout(struct inode *ino,
pnfs_free_layout_hdr(new);
return nfsi->layout;
}
+EXPORT_SYMBOL_GPL(pnfs_find_alloc_layout);
/*
* iomode matching rules:
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 1509530..9614ac9 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -209,6 +209,10 @@ struct pnfs_layout_segment *pnfs_update_layout(struct inode *ino,
u64 count,
enum pnfs_iomode iomode,
gfp_t gfp_flags);
+struct pnfs_layout_hdr *
+pnfs_find_alloc_layout(struct inode *ino,
+ struct nfs_open_context *ctx,
+ gfp_t gfp_flags);
void nfs4_deviceid_mark_client_invalid(struct nfs_client *clp);
--
1.7.1.262.g5ef3d
On Tue, 2011-11-29 at 14:40 -0800, Boaz Harrosh wrote:
> The reason Netapp don't see a point is not a factor. I have not seen a netapp
> server doing a top 100 HPC cluster either.
BTW: Stop adding this "NetApp" insinuation bullshit...
I _strongly_ resent the fact that you appear to think I'm incapable of
thinking for myself...
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On 2011-12-03 06:52, Peng Tao wrote:
> This gives LD option not to ask for layout in pg_init.
>
> Signed-off-by: Peng Tao <[email protected]>
> ---
> fs/nfs/pnfs.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 46 insertions(+), 0 deletions(-)
>
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 734e670..c8dc0b1 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1254,6 +1254,7 @@ pnfs_do_multiple_writes(struct nfs_pageio_descriptor *desc, struct list_head *he
> struct nfs_write_data *data;
> const struct rpc_call_ops *call_ops = desc->pg_rpc_callops;
> struct pnfs_layout_segment *lseg = desc->pg_lseg;
> + const bool has_lseg = !!lseg;
nit: "has_lseg = (lseg != NULL)" would be more straight forward IMO
>
> desc->pg_lseg = NULL;
> while (!list_empty(head)) {
> @@ -1262,7 +1263,29 @@ pnfs_do_multiple_writes(struct nfs_pageio_descriptor *desc, struct list_head *he
> data = list_entry(head->next, struct nfs_write_data, list);
> list_del_init(&data->list);
>
> + if (!has_lseg) {
> + struct nfs_page *req = nfs_list_entry(data->pages.next);
> + __u64 length = data->npages << PAGE_CACHE_SHIFT;
> +
> + lseg = pnfs_update_layout(desc->pg_inode,
> + req->wb_context,
> + req_offset(req),
> + length,
> + IOMODE_RW,
> + GFP_NOFS);
> + if (!lseg || length > (lseg->pls_range.length)) {
I'm concerned about the 'length' part of this condition.
pnfs_try_to_write_data should handle short writes/reads and
we should be able to iterate through the I/O using different
layout segments.
> + put_lseg(lseg);
> + lseg = NULL;
> + pnfs_write_through_mds(desc,data);
> + continue;
> + }
> + }
> +
> trypnfs = pnfs_try_to_write_data(data, call_ops, lseg, how);
> + if (!has_lseg) {
> + put_lseg(lseg);
> + lseg = NULL;
> + }
We had an implementation in the past that saved the most recent lseg in 'desc'
so it could be used for the remaining requests. Once exhausted, you can
look for a new one.
> if (trypnfs == PNFS_NOT_ATTEMPTED)
> pnfs_write_through_mds(desc, data);
> }
> @@ -1350,6 +1373,7 @@ pnfs_do_multiple_reads(struct nfs_pageio_descriptor *desc, struct list_head *hea
> struct nfs_read_data *data;
> const struct rpc_call_ops *call_ops = desc->pg_rpc_callops;
> struct pnfs_layout_segment *lseg = desc->pg_lseg;
> + const bool has_lseg = !!lseg;
ditto
>
> desc->pg_lseg = NULL;
> while (!list_empty(head)) {
> @@ -1358,7 +1382,29 @@ pnfs_do_multiple_reads(struct nfs_pageio_descriptor *desc, struct list_head *hea
> data = list_entry(head->next, struct nfs_read_data, list);
> list_del_init(&data->list);
>
> + if (!has_lseg) {
> + struct nfs_page *req = nfs_list_entry(data->pages.next);
> + __u64 length = data->npages << PAGE_CACHE_SHIFT;
> +
> + lseg = pnfs_update_layout(desc->pg_inode,
> + req->wb_context,
> + req_offset(req),
> + length,
> + IOMODE_READ,
> + GFP_KERNEL);
> + if (!lseg || length > lseg->pls_range.length) {
> + put_lseg(lseg);
> + lseg = NULL;
> + pnfs_read_through_mds(desc, data);
> + continue;
> + }
> + }
> +
> trypnfs = pnfs_try_to_read_data(data, call_ops, lseg);
> + if (!has_lseg) {
> + put_lseg(lseg);
> + lseg = NULL;
> + }
ditto
Benny
> if (trypnfs == PNFS_NOT_ATTEMPTED)
> pnfs_read_through_mds(desc, data);
> }
On 11/29/2011 01:34 PM, Boaz Harrosh wrote:
> But just do the above and you'll see that it is perfect.
>
> BTW don't limit the lo_segment size by the max_io_size. This is why you
> have .bg_test to signal when IO is maxed out.
>
> - The read segments should be as big as possible (i_size long)
> - The Write segments should ideally be as big as the Application
> wants to write to. (Amount of dirty pages at time of nfs-write-out
> is a very good first approximation).
>
> So I guess it is: I hate these patches, to much mess, too little goodness.
>
> Thank
> Boaz
>
Ho and one more thing.
Also Files when they will support segments and servers that request segments,
like the CEPH server, will very much enjoy the above, .i.e: Tell me the amount
you know you want to write.
And surly Objects will enjoy that tremendously. Is it a 17 bytes text file, or
the beginning of a big video file.
But the problem with Files and Objects is that they need a layout first before
they can do the right thing in .pg_test and say if this belongs to this IO or
to the next. (Files is stipe_unit maxed+aliened, Objects raid-group boundary)
So your solution is not good for files and Objects, my way solves them two.
Thanks
Boaz
On 12/02/2011 08:52 PM, Peng Tao wrote:
> Issuing layoutget at .pg_init will drop the IO size information and ask for 4KB
> layout every time. However, the IO size information is very valuable for MDS to
> determine how much layout it should return to client.
>
> The patchset try to allow LD not to send layoutget at .pg_init but instead at
> pnfs_do_multiple_writes. So that real IO size is preserved and sent to MDS.
>
> Tests against a server that does not aggressively pre-allocate layout, shows
> that the IO size informantion is really useful to block layout MDS.
>
> The generic pnfs layer changes are trival to file layout and object as long as
> they still send layoutget at .pg_init.
>
I have a better solution for your problem. Which is a much smaller a change and
I think gives you much better heuristics.
Keep the layout_get exactly where it is, but instead of sending PAGE_SIZE send
the amount of dirty pages you have.
If it is a linear write you will be exact on the money with a single lo_get. If
it is an heavy random write then you might need more lo_gets and you might be getting
some unused segments. But heavy random write is rare and slow anyway. As a first
approximation its fine. (We can later fix that as well)
The .pg_init is done after .write_pages call from VFS and all the to-be-written
pages are already staged to be written. So there should be a way to easily extract
that information.
> iozone cmd:
> ./iozone -r 1m -s 4G -w -W -c -t 10 -i 0 -F /mnt/iozone.data.1 /mnt/iozone.data.2 /mnt/iozone.data.3 /mnt/iozone.data.4 /mnt/iozone.data.5 /mnt/iozone.data.6 /mnt/iozone.data.7 /mnt/iozone.data.8 /mnt/iozone.data.9 /mnt/iozone.data.10
>
> Befor patch: around 12MB/s throughput
> After patch: around 72MB/s throughput
>
Yes Yes that stupid Brain dead Server is no indication for anything. The server
should know best about optimal sizes and layouts. Please don't give me that stuff
again.
But just do the above and you'll see that it is perfect.
BTW don't limit the lo_segment size by the max_io_size. This is why you
have .bg_test to signal when IO is maxed out.
- The read segments should be as big as possible (i_size long)
- The Write segments should ideally be as big as the Application
wants to write to. (Amount of dirty pages at time of nfs-write-out
is a very good first approximation).
So I guess it is: I hate these patches, to much mess, too little goodness.
Thank
Boaz
> Peng Tao (4):
> nfsv41: export pnfs_find_alloc_layout
> nfsv41: add and export pnfs_find_get_layout_locked
> nfsv41: get lseg before issue LD IO if pgio doesn't carry lseg
> pnfsblock: do ask for layout in pg_init
>
> fs/nfs/blocklayout/blocklayout.c | 54 ++++++++++++++++++++++++++-
> fs/nfs/pnfs.c | 74 +++++++++++++++++++++++++++++++++++++-
> fs/nfs/pnfs.h | 9 +++++
> 3 files changed, 134 insertions(+), 3 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/29/2011 07:16 PM, [email protected] wrote:
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Boaz
>> Harrosh
>> Sent: Wednesday, November 30, 2011 5:34 AM
>> To: Peng Tao
>> Cc: [email protected]; [email protected]; [email protected]
>> Subject: Re: [PATCH 0/4] nfs41: allow layoutget at pnfs_do_multiple_writes
>>
>> On 12/02/2011 08:52 PM, Peng Tao wrote:
>>> Issuing layoutget at .pg_init will drop the IO size information and ask for 4KB
>>> layout every time. However, the IO size information is very valuable for MDS to
>>> determine how much layout it should return to client.
>>>
>>> The patchset try to allow LD not to send layoutget at .pg_init but instead at
>>> pnfs_do_multiple_writes. So that real IO size is preserved and sent to MDS.
>>>
>>> Tests against a server that does not aggressively pre-allocate layout, shows
>>> that the IO size informantion is really useful to block layout MDS.
>>>
>>> The generic pnfs layer changes are trival to file layout and object as long as
>>> they still send layoutget at .pg_init.
>>>
>>
>> I have a better solution for your problem. Which is a much smaller a change and
>> I think gives you much better heuristics.
>>
>> Keep the layout_get exactly where it is, but instead of sending PAGE_SIZE send
>> the amount of dirty pages you have.
>>
>> If it is a linear write you will be exact on the money with a single lo_get. If
>> it is an heavy random write then you might need more lo_gets and you might be getting
>> some unused segments. But heavy random write is rare and slow anyway. As a first
>> approximation its fine. (We can later fix that as well)
>
> I would say no to the above... For objects/files MDS, it may not hurt
> much to allocate wasting layout. But for blocklayout server, each
> layout allocation consumes much more resource than just giving out
> stripping information like objects/files.
That's fine, for the linear IO like iozone below my way is just the same
as yours. For the random IO I'm not sure how much better will your solution
be. Not by much.
I want a solution for objects as well. But I cannot use yours because I need
a layout before the final request consolidation. Solve my problem too.
> So helping MDS to do the
> correct decision is the right thing for client to do.
I agree. All I'm saying is that there is available information at the time
of .pg_init to send that number just fine. Have you looked? it's all there
NFS core can tell you how many pages have passed ->write_pages.
>
>>
>> The .pg_init is done after .write_pages call from VFS and all the to-be-written
>> pages are already staged to be written. So there should be a way to easily extract
>> that information.
>>
>>> iozone cmd:
>>> ./iozone -r 1m -s 4G -w -W -c -t 10 -i 0 -F /mnt/iozone.data.1 /mnt/iozone.data.2 /mnt/iozone.data.3
>> /mnt/iozone.data.4 /mnt/iozone.data.5 /mnt/iozone.data.6 /mnt/iozone.data.7 /mnt/iozone.data.8
>> /mnt/iozone.data.9 /mnt/iozone.data.10
>>>
>>> Befor patch: around 12MB/s throughput
>>> After patch: around 72MB/s throughput
>>>
>>
>> Yes Yes that stupid Brain dead Server is no indication for anything. The server
>> should know best about optimal sizes and layouts. Please don't give me that stuff
>> again.
>>
> Actually the server is already doing layout pre-allocation. It is
> just that it doesn't know what client really wants so cannot do it
> too aggressively. That's why I wanted to make client to send the REAL
> IO size information to server. From performance perspective, dropping
> IO size information is always a BAD THING(TM) to do.
I totally agree. I want it too. There is a way to do it in pg_init time
all the information is there it only needs to be passed to layout_get.
>
>> BTW don't limit the lo_segment size by the max_io_size. This is why you
>> have .bg_test to signal when IO is maxed out.
>>
> Actually lo_segment size is never limited by max_io_size. Server is
> always entitled to send larger layout than client asks from.
You miss my point. In your last patch you have
+/* While RFC doesn't limit maximum size of layout, we better limit it ourself. */
+#define PNFSBLK_MAXRSIZE (0x1<<22)
+#define PNFSBLK_MAXWSIZE (0x1<<21)
I don't know what these number mean but they kind of look like IO limits
and not segment limits. If I'm wrong then sorry. What are these numbers?
If a client has 1G of dirty pages to write why not get the full layout
at once. Where does the 4M limit comes from?
>
>> - The read segments should be as big as possible (i_size long)
>> - The Write segments should ideally be as big as the Application
>> wants to write to. (Amount of dirty pages at time of nfs-write-out
>> is a very good first approximation).
>>
>> So I guess it is: I hate these patches, to much mess, too little goodness.
> I'm afraid I can't agree with you...
>
Sure you do. You did the hard work and now I'm telling you you need to do
more work. I'm sorry for that. But I want a solution for me and I think
there is a simple solution that will satisfy both of our needs.
Sorry for that. If I had time I would do it. Only I have harder real BUGS
to fix on my plate.
If you could look into it It will be very nice. And thank you for working
on this so far. Only that current solution is not optimal and I will need
to continue on it later, if left as is.
> Thanks,
> Tao
>
Thanks
On Tue, 2011-11-29 at 16:24 -0800, Boaz Harrosh wrote:
> On 11/29/2011 03:30 PM, Trond Myklebust wrote:
> > On Tue, 2011-11-29 at 14:58 -0800, Boaz Harrosh wrote:
> >>
> >> The kind of typologies I'm talking about a single layout get ever 1GB is
> >> marginal to the gain I get in deploying 100 of DSs. I have thousands of
> >> DSs I want to spread the load evenly. I'm limited by the size of the layout
> >> (Device info in the case of files) So I'm limited by the number of DSs I can
> >> have in a layout. For large files these few devices become an hot spot all
> >> the while the rest of the cluster is idle.
> >
> > I call "bullshit" on that whole argument...
> >
> > You've done sod all so far to address the problem of a client managing
>
> sod? I don't know this word?
'sod all' == 'nothing'
it's an English slang...
> > layout segments for a '1000 DS' case. Are you expecting that all pNFS
> > object servers out there are going to do that for you? How do I assume
> > that a generic pNFS files server is going to do the same? As far as I
> > know, the spec is completely moot on the whole subject.
> >
>
> What? The all segments thing is in the Generic part of the spec and is not
> at all specific or even specified in the objects and blocks RFCs.
..and it doesn't say _anything_ about how a client is supposed to manage
them in order to maximise efficiency.
> There is no layout in the spec, there are only layout_segments. Actually
> what we call layout_segments, in the spec, it is simply called a layout.
>
> The client asks for a layout (segment) and gets one. An ~0 length one
> is just a special case. Without layout_get (segment) there is no optional
> pnfs support.
>
> So we are reading two different specs because to me it clearly says
> layout - which is a segment.
>
> Because the way I read it the pNFS is optional in 4.1. But if I'm a
> pNFS client I need to expect layouts (segments)
>
> > IOW: I'm not even remotely interested in your "everyday problems" if
> > there are no "everyday solutions" that actually fit the generic can of
> > spec worms that the pNFS layout segments open.
>
> That I don't understand. What "spec worms that the pNFS layout segments open"
> Are you seeing. Because it works pretty simple for me. And I don't see the
> big difference for files. One thing I learned for the past is that when you
> have concerns I should understand them and start to address them. Because
> your insights are usually on the Money. If you are concerned then there is
> something I should fix.
I'm saying that if I need to manage layouts that deal with >1000 DSes,
then I presumably need a strategy for ensuring that I return/forget
segments that are no longer needed, and I need a strategy for ensuring
that I always hold the segments that I do need; otherwise, I could just
ask for a full-file layout and deal with the 1000 DSes (which is what we
do today)...
My problem is that the spec certainly doesn't give me any guidance as to
such a strategy, and I haven't seen anybody else step up to the plate.
In fact, I strongly suspect that such a strategy is going to be very
application specific.
IOW: I don't accept that a layout-segment based solution is useful
without some form of strategy for telling me which segments to keep and
which to throw out when I start hitting client resource limits. I also
haven't seen any strategy out there for setting loga_length (as opposed
to loga_minlength) in the LAYOUTGET requests: as far as I know that is
going to be heavily application-dependent in the 1000-DS world.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On 11/29/2011 03:01 PM, Trond Myklebust wrote:
> On Tue, 2011-11-29 at 14:40 -0800, Boaz Harrosh wrote:
>
>> The reason Netapp don't see a point is not a factor. I have not seen a netapp
>> server doing a top 100 HPC cluster either.
>
> BTW: Stop adding this "NetApp" insinuation bullshit...
>
> I _strongly_ resent the fact that you appear to think I'm incapable of
> thinking for myself...
>
OK I take that back, and apologize. Please for give me. No insult intended.
I got emotional and feared that fireworks we tend to have about the all
thing. I should have exercised more Control.
If any thing it is the opposite, I think you are a brilliant guy, and
I admire your thinking, hence my big dilemma.
Heart
On 2011-11-30 21:44, J. Bruce Fields wrote:
> On Tue, Nov 29, 2011 at 04:52:59PM -0800, Marc Eshel wrote:
>> Trond Myklebust <[email protected]> wrote on 11/29/2011 04:37:05
>> PM:
>>>
>>> Peng Tao, bhalevy, Boaz Harrosh, Garth Gibson, Fred Isaman, linux-
>>> nfs, Matt Benjamin
>>>
>>> On Tue, 2011-11-29 at 16:20 -0800, Marc Eshel wrote:
>>>> You ignored my main point, I was talking about the server side, my
>> point
>>>> was that there is nothing to build on on the serve side since the pNFS
>>
>>>> Linux server is not happening.
>>>> Marc.
>>>
>>> Sorry. I misunderstood your concern. As far as I know, the main problem
>>> there is also one of investment: nobody has stepped up to help Bruce
>>> write a pNFS server.
>>>
>>> I'm less worried about this now than I was earlier, because other open
>>> source efforts are gaining traction (see Ganesha - which is being
>>> sponsored by IBM, and projects such as Tigran's java based pNFS server).
>>> The other point is that we've developed client test-rigs that don't
>>> depend on the availability of a Linux server (newpynfs and the pynfs
>>> based proxy).
>>
>> You got it backward, Ganesha is getting traction precisely because the
>> Linux kernel server is not happening :)
>
> My understanding is the same as Trond's--the reason it's not happening
> is because nobody is making an effort to merge it. What am I missing?
>
Tonian is working on this and we've allocated resources also to help
out on your nfs4.1 todo list.
Benny
> --b.
On Thu, Dec 01, 2011 at 06:14:57AM -0500, J. Bruce Fields wrote:
> On Thu, Dec 01, 2011 at 11:47:05AM +0200, Benny Halevy wrote:
> > On 2011-11-30 21:44, J. Bruce Fields wrote:
> > > My understanding is the same as Trond's--the reason it's not happening
> > > is because nobody is making an effort to merge it. What am I missing?
> > >
> >
> > Tonian is working on this and we've allocated resources also to help
> > out on your nfs4.1 todo list.
>
> Great, thanks; let me know how I can help.
(And, apologies, I shouldn't have said *nobody*: Mi Jinlong has been
doing some good work for a while now. I also appreciate that you've
managed to squeeze in some time for testing and the occasional patch.
And thanks to Lior for the help triaging pynfs issues--I do hope to get
back to that soon. But this code still isn't getting the love it
needs....)
--b.
On Thu, Dec 01, 2011 at 11:47:05AM +0200, Benny Halevy wrote:
> On 2011-11-30 21:44, J. Bruce Fields wrote:
> > On Tue, Nov 29, 2011 at 04:52:59PM -0800, Marc Eshel wrote:
> >> Trond Myklebust <[email protected]> wrote on 11/29/2011 04:37:05
> >> PM:
> >>>
> >>> Peng Tao, bhalevy, Boaz Harrosh, Garth Gibson, Fred Isaman, linux-
> >>> nfs, Matt Benjamin
> >>>
> >>> On Tue, 2011-11-29 at 16:20 -0800, Marc Eshel wrote:
> >>>> You ignored my main point, I was talking about the server side, my
> >> point
> >>>> was that there is nothing to build on on the serve side since the pNFS
> >>
> >>>> Linux server is not happening.
> >>>> Marc.
> >>>
> >>> Sorry. I misunderstood your concern. As far as I know, the main problem
> >>> there is also one of investment: nobody has stepped up to help Bruce
> >>> write a pNFS server.
> >>>
> >>> I'm less worried about this now than I was earlier, because other open
> >>> source efforts are gaining traction (see Ganesha - which is being
> >>> sponsored by IBM, and projects such as Tigran's java based pNFS server).
> >>> The other point is that we've developed client test-rigs that don't
> >>> depend on the availability of a Linux server (newpynfs and the pynfs
> >>> based proxy).
> >>
> >> You got it backward, Ganesha is getting traction precisely because the
> >> Linux kernel server is not happening :)
> >
> > My understanding is the same as Trond's--the reason it's not happening
> > is because nobody is making an effort to merge it. What am I missing?
> >
>
> Tonian is working on this and we've allocated resources also to help
> out on your nfs4.1 todo list.
Great, thanks; let me know how I can help.
--b.