On 06/07/2010 06:24 PM, sfaibish wrote:
> Boaz,
>
> You were mentioning some preliminary performance on NFS4.1 and pNFS during
> the pNFS call few weeks back. I thought you put them in an email but I
> couldn't
> find that email. Could you re-send it to me or summarize the results in a
> new
> email for comparison to the block layout performance. Bruce is also
> interested
> so I CC him as well. Thanks
>
> /Sorin
>
I did not yet publish the Document. It's stuck behind my dis-talent for
writing and the pnfs bugs de jur.
Basically all machines:
- connected by a 1 GBit link.
- All clients doing a dd write of 8GB file from /dev/zero
- 3of8 is the special raid-groups arrangement of exofs && objlayout
where out of 8 devices each file is striped over 3 devices in a
round robin fashion. (*With a small dirty trick)
[single client]
1 - osds 40MB
2 - osds 80MB
4 - osds 114MB (saturation point of the 1 Gbit link)
8 - osds 114MB
[2 clients 8of8 osds]
226 MBs
[4 clients 8of8 osds]
263 MBs
[8 clients 8of8 osds]
252 MBs
[1 clients 3of8 osds]
114 MBs
[2 clients 3of8 osds *]
226 MBs
[4 clients 3of8 osds *]
417 MBs
[8 clients 3of8 osds]
405 MBs
Boaz
On 06/07/2010 09:49 PM, J. Bruce Fields wrote:
>>
>> It's a know problem with a network storage cluster. What happens is
>> that with 8of8 all the clients exercise all of the nodes at the same
>> time so they are clashing on the network.
>
> OK, so if two clients are both trying to send a stripe of data to the
> same OSD data at the same time, absent a switch that could somehow
> afford to queue up a full stripe-unit's worth of data, packets get lost?
>
It's tcp they don't get lost, per-se they just get queued up. And that tcp
ramp up and all that, you know.
We use a 64k stripe unit with say raid of 4-8 that's 256k-1M bytes in a stripe.
I don't think a network buffer that big will help at all. It'll just delay
everything more. The best is a sound statistical network strategy that'll let
the system even out overall. (Or not ...)
> (Also, out of curiosity: do you know of any papers or documentation that
> describe that problem in more detail?)
>
Personally, I'm privileged to learn from the best here at Panasas.
CC: Brent, Can you recommend to Bruce some good papers about raid
groups and network SAN strategies?
> --b.
Boaz
On 2010-06-07 21:49, J. Bruce Fields wrote:
> On Mon, Jun 07, 2010 at 09:41:29PM +0300, Boaz Harrosh wrote:
>> On 06/07/2010 09:29 PM, J. Bruce Fields wrote:
>>>>> On 06/07/2010 07:07 PM, Boaz Harrosh wrote:
>>>>>> I did not yet publish the Document. It's stuck behind my dis-talent for
>>>>>> writing and the pnfs bugs de jur.
>>>
>>> Untalented writing we can fix, as long as the details are there!
>>>
>>>>>>
>>>>>> Basically all machines:
>>>>>> - connected by a 1 GBit link.
>>>>>> - All clients doing a dd write of 8GB file from /dev/zero
>>>>>> - 3of8 is the special raid-groups arrangement of exofs && objlayout
>>>>>> where out of 8 devices each file is striped over 3 devices in a
>>>>>> round robin fashion. (*With a small dirty trick)
>>>
>>> Random stupid questions:
>>>
>>> - why do you think the 3of8 arrangement is scaling better than
>>> the 8of8?
>>
>> It's a know problem with a network storage cluster. What happens is
>> that with 8of8 all the clients exercise all of the nodes at the same
>> time so they are clashing on the network.
>
> OK, so if two clients are both trying to send a stripe of data to the
> same OSD data at the same time, absent a switch that could somehow
> afford to queue up a full stripe-unit's worth of data, packets get lost?
>
> (Also, out of curiosity: do you know of any papers or documentation that
> describe that problem in more detail?)
>
A good place to start would be
http://www.pdl.cmu.edu/Incast/
Benny
> --b.
> --
>> On 06/07/2010 07:07 PM, Boaz Harrosh wrote:
>>> I did not yet publish the Document. It's stuck behind my dis-talent for
>>> writing and the pnfs bugs de jur.
Untalented writing we can fix, as long as the details are there!
>>>
>>> Basically all machines:
>>> - connected by a 1 GBit link.
>>> - All clients doing a dd write of 8GB file from /dev/zero
>>> - 3of8 is the special raid-groups arrangement of exofs && objlayout
>>> where out of 8 devices each file is striped over 3 devices in a
>>> round robin fashion. (*With a small dirty trick)
Random stupid questions:
- why do you think the 3of8 arrangement is scaling better than
the 8of8?
- Have you tried any other workloads? (Perfectly reasonable
that simple write throughput would be the first thing to
check--I'm just curious.)
>>>
>>
>> - All tests over an *empty* filesystem.
>>
>>> [single client]
>>> 1 - osds 40MB
>>> 2 - osds 80MB
>>> 4 - osds 114MB (saturation point of the 1 Gbit link)
>>> 8 - osds 114MB
>>>
>>> [2 clients 8of8 osds]
>>> 226 MBs
>>>
>>> [4 clients 8of8 osds]
>>> 263 MBs
>>>
>>> [8 clients 8of8 osds]
>>> 252 MBs
>>>
>>> [1 clients 3of8 osds]
>>> 114 MBs
>>>
>>> [2 clients 3of8 osds *]
>>> 226 MBs
>>>
>>> [4 clients 3of8 osds *]
>>> 417 MBs
If each osd has a single gigabit interface, and you're striping to 3, of
them, isn't that 417/3 == 139 MB/s each?
(Oh, I see: you must be writing to a different file from each client,
hence you are using all osd's even if each client is only using 3?)
--b.
>>>
>>> [8 clients 3of8 osds]
>>> 405 MBs
Problem solved; I sent Bruce 2 relevant papers from CMU and FAST 2009.
/Sorin
On Tue, 08 Jun 2010 02:54:53 -0400, Benny Halevy <[email protected]>
wrote:
> On 2010-06-07 21:49, J. Bruce Fields wrote:
>> On Mon, Jun 07, 2010 at 09:41:29PM +0300, Boaz Harrosh wrote:
>>> On 06/07/2010 09:29 PM, J. Bruce Fields wrote:
>>>>>> On 06/07/2010 07:07 PM, Boaz Harrosh wrote:
>>>>>>> I did not yet publish the Document. It's stuck behind my
>>>>>>> dis-talent for
>>>>>>> writing and the pnfs bugs de jur.
>>>>
>>>> Untalented writing we can fix, as long as the details are there!
>>>>
>>>>>>>
>>>>>>> Basically all machines:
>>>>>>> - connected by a 1 GBit link.
>>>>>>> - All clients doing a dd write of 8GB file from /dev/zero
>>>>>>> - 3of8 is the special raid-groups arrangement of exofs && objlayout
>>>>>>> where out of 8 devices each file is striped over 3 devices in a
>>>>>>> round robin fashion. (*With a small dirty trick)
>>>>
>>>> Random stupid questions:
>>>>
>>>> - why do you think the 3of8 arrangement is scaling better than
>>>> the 8of8?
>>>
>>> It's a know problem with a network storage cluster. What happens is
>>> that with 8of8 all the clients exercise all of the nodes at the same
>>> time so they are clashing on the network.
>>
>> OK, so if two clients are both trying to send a stripe of data to the
>> same OSD data at the same time, absent a switch that could somehow
>> afford to queue up a full stripe-unit's worth of data, packets get lost?
>>
>> (Also, out of curiosity: do you know of any papers or documentation that
>> describe that problem in more detail?)
>>
>
> A good place to start would be
> http://www.pdl.cmu.edu/Incast/
>
> Benny
>
>> --b.
>> --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group
EMC?
where information lives
Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : [email protected]
On Mon, Jun 07, 2010 at 09:41:29PM +0300, Boaz Harrosh wrote:
> On 06/07/2010 09:29 PM, J. Bruce Fields wrote:
> >>> On 06/07/2010 07:07 PM, Boaz Harrosh wrote:
> >>>> I did not yet publish the Document. It's stuck behind my dis-talent for
> >>>> writing and the pnfs bugs de jur.
> >
> > Untalented writing we can fix, as long as the details are there!
> >
> >>>>
> >>>> Basically all machines:
> >>>> - connected by a 1 GBit link.
> >>>> - All clients doing a dd write of 8GB file from /dev/zero
> >>>> - 3of8 is the special raid-groups arrangement of exofs && objlayout
> >>>> where out of 8 devices each file is striped over 3 devices in a
> >>>> round robin fashion. (*With a small dirty trick)
> >
> > Random stupid questions:
> >
> > - why do you think the 3of8 arrangement is scaling better than
> > the 8of8?
>
> It's a know problem with a network storage cluster. What happens is
> that with 8of8 all the clients exercise all of the nodes at the same
> time so they are clashing on the network.
OK, so if two clients are both trying to send a stripe of data to the
same OSD data at the same time, absent a switch that could somehow
afford to queue up a full stripe-unit's worth of data, packets get lost?
(Also, out of curiosity: do you know of any papers or documentation that
describe that problem in more detail?)
--b.
On 06/07/2010 09:29 PM, J. Bruce Fields wrote:
>>> On 06/07/2010 07:07 PM, Boaz Harrosh wrote:
>>>> I did not yet publish the Document. It's stuck behind my dis-talent for
>>>> writing and the pnfs bugs de jur.
>
> Untalented writing we can fix, as long as the details are there!
>
>>>>
>>>> Basically all machines:
>>>> - connected by a 1 GBit link.
>>>> - All clients doing a dd write of 8GB file from /dev/zero
>>>> - 3of8 is the special raid-groups arrangement of exofs && objlayout
>>>> where out of 8 devices each file is striped over 3 devices in a
>>>> round robin fashion. (*With a small dirty trick)
>
> Random stupid questions:
>
> - why do you think the 3of8 arrangement is scaling better than
> the 8of8?
It's a know problem with a network storage cluster. What happens is
that with 8of8 all the clients exercise all of the nodes at the same
time so they are clashing on the network.
With 3of8 each node can still saturate it's link. (3 was chosen carefully from the
first test) and some nodes talk to some OSDs while other talk to other, so there is
more chance of pairs * 1GBit at the same time.
(The dirty trick I did was insert dummy files so the 4 client test will exercise all
8 devices. Otherwise the stupid exofs round robin algorithm would only exercise 4+3
devices.)
> - Have you tried any other workloads? (Perfectly reasonable
> that simple write throughput would be the first thing to
> check--I'm just curious.)
Never got to it. Busy with Bakeathon preparations. Would like too very much
Thanks
Boaz
>
>>>>
>>>
>>> - All tests over an *empty* filesystem.
>>>
>>>> [single client]
>>>> 1 - osds 40MB
>>>> 2 - osds 80MB
>>>> 4 - osds 114MB (saturation point of the 1 Gbit link)
>>>> 8 - osds 114MB
>>>>
>>>> [2 clients 8of8 osds]
>>>> 226 MBs
>>>>
>>>> [4 clients 8of8 osds]
>>>> 263 MBs
>>>>
>>>> [8 clients 8of8 osds]
>>>> 252 MBs
>>>>
>>>> [1 clients 3of8 osds]
>>>> 114 MBs
>>>>
>>>> [2 clients 3of8 osds *]
>>>> 226 MBs
>>>>
>>>> [4 clients 3of8 osds *]
>>>> 417 MBs
>
> If each osd has a single gigabit interface, and you're striping to 3, of
> them, isn't that 417/3 == 139 MB/s each?
>
> (Oh, I see: you must be writing to a different file from each client,
> hence you are using all osd's even if each client is only using 3?)
>
> --b.
>
>>>>
>>>> [8 clients 3of8 osds]
>>>> 405 MBs
On 06/07/2010 09:29 PM, J. Bruce Fields wrote:
>>>> [4 clients 3of8 osds *]
>>>> 417 MBs
>
> If each osd has a single gigabit interface, and you're striping to 3, of
> them, isn't that 417/3 == 139 MB/s each?
>
> (Oh, I see: you must be writing to a different file from each client,
> hence you are using all osd's even if each client is only using 3?)
>
Right and that little trick from the previous email ;-)
Boaz
> --b.
>
>>>>
>>>> [8 clients 3of8 osds]
>>>> 405 MBs
On 06/07/2010 07:07 PM, Boaz Harrosh wrote:
> On 06/07/2010 06:24 PM, sfaibish wrote:
>> Boaz,
>>
>> You were mentioning some preliminary performance on NFS4.1 and pNFS during
>> the pNFS call few weeks back. I thought you put them in an email but I
>> couldn't
>> find that email. Could you re-send it to me or summarize the results in a
>> new
>> email for comparison to the block layout performance. Bruce is also
>> interested
>> so I CC him as well. Thanks
>>
>> /Sorin
>>
>
> I did not yet publish the Document. It's stuck behind my dis-talent for
> writing and the pnfs bugs de jur.
>
> Basically all machines:
> - connected by a 1 GBit link.
> - All clients doing a dd write of 8GB file from /dev/zero
> - 3of8 is the special raid-groups arrangement of exofs && objlayout
> where out of 8 devices each file is striped over 3 devices in a
> round robin fashion. (*With a small dirty trick)
>
- All tests over an *empty* filesystem.
> [single client]
> 1 - osds 40MB
> 2 - osds 80MB
> 4 - osds 114MB (saturation point of the 1 Gbit link)
> 8 - osds 114MB
>
> [2 clients 8of8 osds]
> 226 MBs
>
> [4 clients 8of8 osds]
> 263 MBs
>
> [8 clients 8of8 osds]
> 252 MBs
>
> [1 clients 3of8 osds]
> 114 MBs
>
> [2 clients 3of8 osds *]
> 226 MBs
>
> [4 clients 3of8 osds *]
> 417 MBs
>
> [8 clients 3of8 osds]
> 405 MBs
>
> Boaz
>
Thanks.
/Sorin
On Mon, 07 Jun 2010 12:13:31 -0400, Boaz Harrosh <[email protected]>=
=20
wrote:
> On 06/07/2010 07:07 PM, Boaz Harrosh wrote:
>> On 06/07/2010 06:24 PM, sfaibish wrote:
>>> Boaz,
>>>
>>> You were mentioning some preliminary performance on NFS4.1 and pNFS=
=20
>>> during
>>> the pNFS call few weeks back. I thought you put them in an email bu=
t I
>>> couldn't
>>> find that email. Could you re-send it to me or summarize the result=
s =20
>>> in a
>>> new
>>> email for comparison to the block layout performance. Bruce is also
>>> interested
>>> so I CC him as well. Thanks
>>>
>>> /Sorin
>>>
>>
>> I did not yet publish the Document. It's stuck behind my dis-talent =
for
>> writing and the pnfs bugs de jur.
>>
>> Basically all machines:
>> - connected by a 1 GBit link.
>> - All clients doing a dd write of 8GB file from /dev/zero
>> - 3of8 is the special raid-groups arrangement of exofs && objlayout
>> where out of 8 devices each file is striped over 3 devices in a
>> round robin fashion. (*With a small dirty trick)
>>
>
> - All tests over an *empty* filesystem.
>
>> [single client]
>> 1 - osds 40MB
>> 2 - osds 80MB
>> 4 - osds 114MB (saturation point of the 1 Gbit link)
>> 8 - osds 114MB
>>
>> [2 clients 8of8 osds]
>> 226 MBs
>>
>> [4 clients 8of8 osds]
>> 263 MBs
>>
>> [8 clients 8of8 osds]
>> 252 MBs
>>
>> [1 clients 3of8 osds]
>> 114 MBs
>>
>> [2 clients 3of8 osds *]
>> 226 MBs
>>
>> [4 clients 3of8 osds *]
>> 417 MBs
>>
>> [8 clients 3of8 osds]
>> 405 MBs
>>
>> Boaz
>>
>
>
>
--=20
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group
EMC=B2
where information lives
Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : [email protected]