2014-11-20 18:24:35

by Shirley Ma

[permalink] [raw]
Subject: NFSoRDMA developers bi-weekly meeting minutes (11/20)

Attendees:

Jeff Becker (NASA)
Yan Burman (Mellanox)
Wendy Cheng (Intel)
Rupert Dance (Soft Forge)
Steve Dickson (Red Hat)
Chuck Lever (Oracle)
Doug Ledford (RedHat)
Shirley Ma (Oracle)
Sachin Prabhu (RedHat)
Devesh Sharma (Emulex)
Anna Schumaker (Net App)
Steve Wise (OpenGridComputing, Chelsio)

Moderator:
Shirley Ma (Oracle)

NFSoRDMA developers bi-weekly meeting is to help organizing NFSoRDMA development and test effort from different resources to speed up NFSoRDMA upstream kernel work and NFSoRDMA diagnosing/debugging tools development. Hopefully the quality of NFSoRDMA upstream patches can be improved by being tested with a quorum of HW vendors.

Today's meeting notes:

NFSoRDMA performance:
---------------------
Even though NFSoRDMA performance seems better than IPoIB-cm, the gap between what the IB protocol can provide and what NFS(RDMA,IPoIB-cm) can achieve is still big on small I/O block size (focused on 8K IO size for database workload). Even large I/O block size(128K above), NFS performance is not comparable to RDMA microbenchmark. We are focusing the effort to figure out the root cause. Several experimental methods have been used on how to improve NFSoRDMA performance.

Yan saw NFS server does RDMA send for small packet size, less than 100bytes, which should have used post_send instead.

1. performance experimental investigation: (Shirley, Chuck, Yan)
-- multiple QPs support:
Created multiple subnets with different partition keys, different NFS client mount points to stretch single link performance, iozone multiple threading DIO 8K showed around 17% improvement, still a big gap to link speed

-- completion vector loading balance
Split send queue and completion queue interrupts to different CPUs did not help on performance, then created a patch on distributing interrupts among available CPUs for different QPs, send and recv completion share the same completion vector, iozone multiple threading 8K DIO showed that 10% performance improvement

Yan shared iser performance enhancement ideas:
-- batch recv packet processing
-- batch completion processing, not signaling every completion
-- per CPU connection, cq
iser 8K could reach 4.5GB/s in 56Gb/s link speed, 1.5 million IOPS. 32K could reach 1.8 million IOPS

-- increasing RPC credit limit from 32 to 64
iozone 8K DIO results doesn't show any gain, which might indicate that we need to look at general NFS IO stack.

-- increasing work queue priority to reduce latency
NFS used work_queue not tasklet since it's in cansleep context, changed the flags to WQ_HIGHPRI | WQ_CPU_INTENSIVE did help reducing the latency when system under heavy workloads.

-- lock contention
perf top does show lock contention on top five list for both NFS client and NFS server. More granularity lock contention investigation is needed.

-- scheduling latency
IO scheduling was developed for high latency devices, there might be some room in IO scheduling improvement.

-- wsize, rsize
Chuck is looking at wsize, rsize to 1MB

2. performance analysis tools to use:
-- perf, lockstat, ftrace, mountstats, nfsiostats...

3. performance test tools:
-- iozone, fio
-- direct IO, cached IO

Next step for performance analysis:
1. Shirley will collect performance data on NFS IO layer to see any bottlenecks there.
2. Someone needs to look at NFS server for RDMA small message size Yan has seen

Feel free to reply here for anything missing. See you 12/4.

12/04/2014
@7:30am PDT
@8:30am MDT
@9:30am CDT
@10:30am EDT
@Bangalore @8:00pm
@Israel @5:30pm

Duration: 1 hour

Call-in number:
Israel: +972 37219638
Bangalore: +91 8039890080 (180030109800)
France Colombes +33 1 5760 2222 +33 176728936
US: 8666824770, 408-7744073

Conference Code: 2308833
Passcode: 63767362 (it's NFSoRDMA, in case you couldn't remember)

Thanks everyone for joining the call and providing valuable inputs/work to the community to make NFSoRDMA better.

Cheers,
Shirley


2014-11-20 18:47:11

by Chuck Lever III

[permalink] [raw]
Subject: Re: NFSoRDMA developers bi-weekly meeting minutes (11/20)


On Nov 20, 2014, at 1:24 PM, Shirley Ma <[email protected]> wrote:

> Attendees:
>
> Jeff Becker (NASA)
> Yan Burman (Mellanox)
> Wendy Cheng (Intel)
> Rupert Dance (Soft Forge)
> Steve Dickson (Red Hat)
> Chuck Lever (Oracle)
> Doug Ledford (RedHat)
> Shirley Ma (Oracle)
> Sachin Prabhu (RedHat)
> Devesh Sharma (Emulex)
> Anna Schumaker (Net App)
> Steve Wise (OpenGridComputing, Chelsio)
>
> Moderator:
> Shirley Ma (Oracle)
>
> NFSoRDMA developers bi-weekly meeting is to help organizing NFSoRDMA development and test effort from different resources to speed up NFSoRDMA upstream kernel work and NFSoRDMA diagnosing/debugging tools development. Hopefully the quality of NFSoRDMA upstream patches can be improved by being tested with a quorum of HW vendors.
>
> Today's meeting notes:
>
> NFSoRDMA performance:
> ---------------------
> Even though NFSoRDMA performance seems better than IPoIB-cm, the gap between what the IB protocol can provide and what NFS(RDMA,IPoIB-cm) can achieve is still big on small I/O block size (focused on 8K IO size for database workload). Even large I/O block size(128K above), NFS performance is not comparable to RDMA microbenchmark. We are focusing the effort to figure out the root cause. Several experimental methods have been used on how to improve NFSoRDMA performance.
>
> Yan saw NFS server does RDMA send for small packet size, less than 100bytes, which should have used post_send instead.

This is an artifact of how NFS/RDMA works.

The client provides a registered area for the server to write
into if an RPC reply is larger than the small pre-posted
buffers that are normally used.

Most of the time, each RPC reply is small enough to use RDMA
SEND, and the server can convey the RPC/RDMA header and the
RPC reply in a single SEND operation.

If the reply is large, the server conveys the RPC/RDMA header
via RDMA send, and the RPC reply via an RDMA WRITE into the
client?s registered buffer.

Solaris server chooses RDMA SEND in nearly every case.

Linux server chooses RDMA SEND then RDMA WRITE whenever
the client offers that choice.

Originally, it was felt that doing the RDMA WRITE is better
for the client because the client doesn?t have to copy the
RPC header from the RDMA receive buffer back into rq_rcv_buf.
Note that the RPC header is generally just a few hundred
bytes.

Several people have claimed that RDMA WRITE for small I/O
is relatively expensive and should be avoided. It?s also
expensive for the client to register and deregister the
receive buffer for the RDMA WRITE if the server doesn?t
use it.

I?ve explored changing the client to offer no registered
buffer if it knows the RPC reply will be small, thus
forcing the server to use RDMA SEND where it?s safe.

Solaris server worked fine. Of course, it already works
this way.

Linux server showed some data and metadata corruption on
complex workloads like kernel builds. There?s a bug in
there somewhere that will need to be addressed before we
can change the client behavior.

The improvement was consistent, but under ten microseconds
per RPC with FRWR (more with FMR because deregistering the
buffer takes longer and is synchronous with RPC execution).

At this stage, there are bigger problems to be addressed,
so this is not a top priority.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2014-11-20 20:15:56

by Cheng, Wendy

[permalink] [raw]
Subject: RE: NFSoRDMA developers bi-weekly meeting minutes (11/20)

> -----Original Message-----
> From: Shirley Ma [mailto:[email protected]]
> Sent: Thursday, November 20, 2014 10:24 AM
>
> ....
> iser 8K could reach 4.5GB/s in 56Gb/s link speed, 1.5 million IOPS. 32K could
> reach 1.8 million IOPS
>

How did the ISER data get measured ? Was the measure done on ISER layer, block layer, or filesystem layer ?

-- Wendy



2014-11-20 22:00:25

by Shirley Ma

[permalink] [raw]
Subject: Re: NFSoRDMA developers bi-weekly meeting minutes (11/20)


On 11/20/2014 12:15 PM, Cheng, Wendy wrote:
>> -----Original Message-----
>> From: Shirley Ma [mailto:[email protected]]
>> Sent: Thursday, November 20, 2014 10:24 AM
>>
>> ....
>> iser 8K could reach 4.5GB/s in 56Gb/s link speed, 1.5 million IOPS. 32K could
>> reach 1.8 million IOPS
>>
>
> How did the ISER data get measured ? Was the measure done on ISER layer, block layer, or filesystem layer ?

Here is the link on iser how to set up and measure performance:
http://community.mellanox.com/docs/DOC-1483

2014-11-24 11:59:23

by Yan Burman

[permalink] [raw]
Subject: RE: NFSoRDMA developers bi-weekly meeting minutes (11/20)



> -----Original Message-----
> From: Shirley Ma [mailto:[email protected]]
> Sent: Friday, November 21, 2014 00:00
> To: Cheng, Wendy; Charles EDWARD Lever; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; rsdance@soft-
> forge.com; [email protected]; [email protected]; [email protected];
> [email protected]; Yan Burman; linux-rdma; Linux NFS Mailing
> List
> Subject: Re: NFSoRDMA developers bi-weekly meeting minutes (11/20)
>
>
> On 11/20/2014 12:15 PM, Cheng, Wendy wrote:
> >> -----Original Message-----
> >> From: Shirley Ma [mailto:[email protected]]
> >> Sent: Thursday, November 20, 2014 10:24 AM
> >>
> >> ....
> >> iser 8K could reach 4.5GB/s in 56Gb/s link speed, 1.5 million IOPS.
> >> 32K could reach 1.8 million IOPS
> >>
> >
> > How did the ISER data get measured ? Was the measure done on ISER layer,
> block layer, or filesystem layer ?
>
> Here is the link on iser how to set up and measure performance:
> http://community.mellanox.com/docs/DOC-1483

Actual numbers are (there seems to be some misunderstanding in the meeting minutes):
For single LUN/session in iSER on ConnectX-3 FDR link with 8 core 2.6GHz Xeon are:
8K block size reaches 2.5GB/s
Somewhere between 16K and 32K block size iSER reaches 5.5GB/s which is almost line rate
256K block size gives 5.7GB/s

With 16 sessions, it is possible to reach 1.7M IOPS with 1K block size and about 600K IOPS with 8K block size.

Note these numbers are for SCST iSER implementation and there are some more tunings and enhancements
that can be applied to further improve performance.

Another issue that came up is the benefit of RDMA vs send.
In order to check that you can use ib_send_lat vs ib_write_lat and see the latencies of different block sizes.