On Friday May 28, [email protected] wrote:
> G'day,
>
> After poking around with my previously posted patch on various
> workloads and irq configurations, I'm convinced that the fairness
> issues I mentioned earlier are entirely due to interactions between the
> hardware, the tg3 driver, and the Linux network device infrastructure,
> rather than anything intrinsic in the patch.
>
> Also, I've fixed the locking problem Trond identified.
>
> So I'm submitting this for real.
>
Thanks...
It looks good and your performance figures are certainly encouraging.
I have two concerns, that can possibly be allayed.
Firstly, the reply to any request is going to go out the socket that
the request came in on. However some network configurations can have
requests arrive on one interface that need to be replied to on a
different interface(*). Does setting sk_bound_dev_if cause packets sent
go out that interface (I'm not familiar enough with the networking
code to be sure)? If it does, then this will cause problems in those
network configurations.
Secondly, while the need for multiple udp sockets is clear, it isn't
clear that they should be per-device.
Other options are per-CPU and per-thread. Alternately there could
simply be a pool of free sockets (or a per-cpu pool).
Having multiple sockets that are not bound differently is not
currently possible without setting sk_reuse, and doing this allows
user programs to steal NFS requests.
However if we could create a socket that was bound to an address only
for sending and not for receiving, we could use a pool of such sockets
for sending.
This might be as easy as adding a "sk_norecv" field to struct sock,
and skipping the sk->sk_prot->get_port call in inet_bind if it is set.
Then all incoming requests would arrive on the one udp socket (there
is very little contention between incoming packets on the one socket),
and reply could go out one of the sk_norecv ports.
Does this make sense? Would you be willing to try it?
NeilBrown
(*) Server S has two interface, A and B, on two subnets.
Client C has an interface on a third subnet. All three subnets
are routed by a single routed R.
S has a default route out A.
C sends a request to interface B. The reply has to go out interface
A.
(I have a network like this and had to fight with a glibc bug which
forced RPC replies out the same interface that the request came
from).
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g.
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Fri, May 28, 2004 at 03:14:10PM +1000, Neil Brown wrote:
> I have two concerns, that can possibly be allayed.
>
> Firstly, the reply to any request is going to go out the socket that
> the request came in on. However some network configurations can have
> requests arrive on one interface that need to be replied to on a
> different interface(*). Does setting sk_bound_dev_if cause packets sent
> go out that interface (I'm not familiar enough with the networking
> code to be sure)? If it does, then this will cause problems in those
> network configurations.
As near as I can tell from inspecting the (amazingly convoluted)
routing code and without running a test, the output routing algorithm
is short circuited when the socket is bound to a device, and will
send the packet out the bound interface regardless. This comment
by Alexei K in ip_route_output_slow() explains:
> [...] When [output interface] is specified, routing
> tables are looked up with only one purpose:
> to catch if destination is gatewayed, rather than
> direct.
So, yes we could have a problem if the routing is asymmetric.
A similar issue (which I haven't tested either) is what happens
with virtual interfaces like tunnels or bonding.
> Secondly, while the need for multiple udp sockets is clear,
Agreed.
> it isn't
> clear that they should be per-device.
> Other options are per-CPU and per-thread. Alternately there could
> simply be a pool of free sockets (or a per-cpu pool).
Ok, there are a number of separate issue here.
Firstly, my performance numbers show that the current serialisation of
svc_sendto() gives about 1.5 NICs worth of performance per socket, so
whatever method governs the number of sockets it needs to ensure that
the number of sockets grows at least as fast as the number of NICs.
On at least some of the SGI hardware the number of NICs can grow
faster than the number of CPUs. For example, the minimal Altix
3000 has 4 CPUs and can have 10 NICs. Similarly, a good block for
building scalable NFS servers is an Altix 350 with 2 CPUs and 4 NICs
(we can't do this yet due to tg3 driver limitations). So this makes
per-CPU sockets unattractive.
Secondly, for the traffic levels I'm trying to reach I need lots of
nfsd threads. I haven't done the testing to find the exact number,
but it's somewhere between above 32. I run with 128 threads to make
sure. If we had per-thread sockets that would be a *lot* of sockets.
Under many loads an nfsd thread spends most of its time waiting for
disk IO, and having its own socket would just be wasteful.
Thirdly, where there are enough CPUs I see significant performance
advantage when each NIC directs irqs to a separate dedicated CPU.
In this case having one socket per NIC will mean all the cachelines
for that socket will tend to stay on the interrupt CPU (currently this
doesn't happen because of the way the tg3 driver handles interrupts,
but that will change).
What all of the above means is that I think having one socket per
NIC is very close to the right scaling ratio.
What I'm not sure about on is the precise way in which multiple
sockets should be achieved. Using device-bound sockets just seemed
like a really easy way (read: no changes to the network stack) to get
exactly the right scaling. Having a global (or per NUMA node, say)
pool of sockets which scaled by the number of NICs would be fine too,
assuming it could be made to work and handle the routing corner cases.
> Having multiple sockets that are not bound differently is not
> currently possible without setting sk_reuse, and doing this allows
> user programs to steal NFS requests.
Yes, this was another good thing about using device-bound sockets,
it doesn't open that security hole.
> However if we could create a socket that was bound to an address only
> for sending and not for receiving, we could use a pool of such sockets
> for sending.
Interesting idea. Alternatively, we could use the single UDP socket
for receive as now, and a pool of connected UDP sockets for sending.
That should work without needing to modify the network stack.
> This might be as easy as adding a "sk_norecv" field to struct sock,
> and skipping the sk->sk_prot->get_port call in inet_bind if it is set.
> Then all incoming requests would arrive on the one udp socket (there
> is very little contention between incoming packets on the one socket),
Sure, the limit is the send path.
> and reply could go out one of the sk_norecv ports.
Aha.
> Does this make sense? Would you be willing to try it?
I think it's an intriguing idea and I'll try it as soon as I can.
Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g.
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
Neil Brown wrote:
>
> I have two concerns, that can possibly be allayed.
>
> Firstly, the reply to any request is going to go out the socket that
> the request came in on. However some network configurations can have
> requests arrive on one interface that need to be replied to on a
> different interface(*).
Neil,
Some misc data: It is my understanding that 'networkappliance data ontap' is
wired up to send replies out the same interface the request came in on. This
solves some problems of managing systems connected via multiple network paths.
I realize it might also create other problems..
eric
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g.
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Tuesday June 1, [email protected] wrote:
> Neil Brown wrote:
> >
> > I have two concerns, that can possibly be allayed.
> >
> > Firstly, the reply to any request is going to go out the socket that
> > the request came in on. However some network configurations can have
> > requests arrive on one interface that need to be replied to on a
> > different interface(*).
>
> Neil,
>
> Some misc data: It is my understanding that 'networkappliance data ontap' is
> wired up to send replies out the same interface the request came in on. This
> solves some problems of managing systems connected via multiple network paths.
> I realize it might also create other problems..
>
> eric
Interesting. Thanks.
It may well be reasonable to have a mode of operation where all
replies go out the interfaces that the requests come in on, and which
utilises the simplification to increase performance.
But it would have to optional and non-default, and I would want to
have a better understanding of the issues, and whether it is the best
way to improve performance.
NeilBrown
-------------------------------------------------------
This SF.Net email is sponsored by the new InstallShield X.
On Tue, Jun 01, 2004 at 10:22:55AM -0600, Eric Whiting wrote:
> Neil Brown wrote:
> >
> > I have two concerns, that can possibly be allayed.
> >
> > Firstly, the reply to any request is going to go out the socket that
> > the request came in on. However some network configurations can have
> > requests arrive on one interface that need to be replied to on a
> > different interface(*).
>
> Neil,
>
> Some misc data: It is my understanding that 'networkappliance data ontap' is
> wired up to send replies out the same interface the request came in on. This
> solves some problems of managing systems connected via multiple network paths.
> I realize it might also create other problems..
Linux is a general purpose OS where users have the expecation that
networking behaves in certain ways. ONTAP drives an appliance whose
purpose is to run file sharing protocols. This means they can make
network level compromises and optimisations that we can't.
Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
-------------------------------------------------------
This SF.Net email is sponsored by the new InstallShield X.
On Fri, May 28, 2004 at 03:14:10PM +1000, Neil Brown wrote:
> However if we could create a socket that was bound to an address only
> for sending and not for receiving, we could use a pool of such sockets
> for sending.
> This might be as easy as adding a "sk_norecv" field to struct sock,
> and skipping the sk->sk_prot->get_port call in inet_bind if it is set.
> Then all incoming requests would arrive on the one udp socket (there
> is very little contention between incoming packets on the one socket),
> and reply could go out one of the sk_norecv ports.
>
> Does this make sense? Would you be willing to try it?
In the last week I've looked at three options for bypassing the current
send limitation.
1. Create multiple connected UDP sockets, and treat them just
like connected TCP sockets, i.e. manage them in sv_tempsocks.
2. Create a small pool of send-only sockets, as you describe.
3. Use a single socket but reduce the time the svsk->sk_sem is
held by building the datagram without the sem held and adding
it to the write queue atomically.
It seems #2 is the easiest to implement. #1 is next easiest but
would require either new hashing code in net/ipv4/udp.c or a logic
change to udp_v4_get_port(), neither of which appeal. #3 is arguably
the right thing to do but is a very large chunk of work.
So I'm pursuing #2. I think it can be done without touching the
UDP code. More info when the patch gets some testing.
Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
-------------------------------------------------------
This SF.Net email is sponsored by: GNOME Foundation
Hackers Unite! GUADEC: The world's #1 Open Source Desktop Event.
GNOME Users and Developers European Conference, 28-30th June in Norway
http://2004/guadec.org
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs