2002-03-26 19:34:45

by Andrew Theurer

[permalink] [raw]
Subject: more SMP issues

Hello,

I am still seeing some scaling problems with my SMP nfs server, so I did some
more tests. I has suspected that there were not enough nfsd threads actually
doing work, since my run queue was so short (2 or less) and CPU utilization
was below 35% (this is a 4-way SMP). So, before the tests, I added nfsd_busy
to /proc/net/rpc/nfsd, so I could monitor exactly how many nfsd threads were
getting to svc_process() at any point in time. I would then monitor this
during my tests, so I know how many nfsd threads are actually getting to
svc_process().

My first test is a NFS client read test. 48 clients each read a 200MB file
from the same server. The elapsed time is recorded for all the clients to
finish, and then the throughput is calculated. Results from this test
raised concerns because I did not see a significant improvement from
uniprocessor to 4-way. I had been running this test for udp and this would
be the result:

Test SMP CPU proto vers rwsize nfsdcount nfsdbusy secs MB/sec NFSops/sec
nfsread 4P 34 udp 3 8k 128 1 109 88 11009

nfsd_busy would never exceed 1 in this test. I have tried with various qty
of nfsd threads, and it always has just one busy thread with one exception:
If I set the number of nfsd threads to 2, nfsd_busy will stay at 2 75% of the
time during the test, CPU util will be about 55%, and I'll get maybe 15-20%
better throughput. If the nfsd thread count is set to anything else,
nfsd_busy will not exceed 1.

I decided to try tcp protocol just to get a comparison, and surprisingly it
_could_ have more than 1 for nfsd_busy:

Test SMP CPU proto vers rwsize nfsdcount nfsdbusy secs MB/sec NFSops/sec
nfsread 4P 100 tcp 3 8k 128 12 110 87 10903

Nfsd_busy reached a maximum of 12, and probably averaged around 8 during
the test. The performance was not better, but I still don't understand why
there can be more than one nfsd thread busy with tcp, and not with udp?? I
know that having nfsd_busy at 2 with udp improves performance. If we can get
udp to consistently use more than one thread at one time, I think we can
boost performance quite a bit.

Now, I was concerned that I may have some sort of network throughput
limitation here, since I was close to 100MB/sec, and in my experience, even
with multiple Gbps adapters (I have 4), the mem copies involved here tend to
saturate the poor 100 MHz memory bus this server has. I was not sure if this
was causing a problem here, so I setup an other test. This time all 48
clients do a "ls -lR" on a 2.5.7 kernel tree on the server. This generates a
very high number of nfs requests with relatively low network throughput
compared the the read test. Total network throughput was under 4 MB/sec
(send and recv). Here are the results:

Test SMP CPU proto vers rwsize nfsdcount nfsdbusy secs MB/sec NFSops/sec
nfsls 4P 100 tcp 3 8k 128 11 127 - 30538
nfsls 4P 34 udp 3 8k 128 1 112 - 34750

Again, with udp I cannot get nfsd_busy to exceed 1. TCP has the same
behavior as the last test. My first thought was that maybe the way the bits
are set and cleared for the socket (SK_BUSY, SK_DATA, SK_CONN, etc) could be
causing a problem here, but I cannot confirm that. It appears they behave
the same way for tcp and udp, so that is probably not it. So, I am now stuck
here. If there is anyone interested in helping me investigate this (Neil?),
please let me know. Thanks for your help.

-Andrew

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2002-03-27 02:57:37

by NeilBrown

[permalink] [raw]
Subject: Re: more SMP issues

On Tuesday March 26, [email protected] wrote:
> Again, with udp I cannot get nfsd_busy to exceed 1. TCP has the same
> behavior as the last test. My first thought was that maybe the way the bits
> are set and cleared for the socket (SK_BUSY, SK_DATA, SK_CONN, etc) could be
> causing a problem here, but I cannot confirm that. It appears they behave
> the same way for tcp and udp, so that is probably not it. So, I am now stuck
> here. If there is anyone interested in helping me investigate this (Neil?),
> please let me know. Thanks for your help.
>

Its very odd.

I'm doing some testing on 2.4.18 plus my patches, using UDP, and I'm
definately getting more than 12 threads being used at times. But this
is a mixed load, not a read-only load.

Could you

echo 256 > /proc/sys/sunrpc/rpc_debug ; sleep 5 ; echo 0 > /proc/sys/sunrpc/rpc_debug

while the udp reading is happening and see if lines mentioning
"no space"
appear in the log. That is about the only thing that I can think of.

The fact that you seem to get more threads if you only ask for 2 is
really odd....

NeilBrown


_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-03-27 16:50:01

by Andrew Theurer

[permalink] [raw]
Subject: Re: more SMP issues

> Its very odd.
>
> I'm doing some testing on 2.4.18 plus my patches, using UDP, and I'm
> definately getting more than 12 threads being used at times. But this
> is a mixed load, not a read-only load.
>
> Could you
>
> echo 256 > /proc/sys/sunrpc/rpc_debug ; sleep 5 ; echo 0 >
> /proc/sys/sunrpc/rpc_debug
>
> while the udp reading is happening and see if lines mentioning
> "no space"
> appear in the log. That is about the only thing that I can think of.

Here is udp:

Mar 27 08:15:52 sambaperf kernel: svc: socket f6ec17c0 served by daemon
f4cd8e00
Mar 27 08:15:52 sambaperf kernel: svc: got len=132
Mar 27 08:15:52 sambaperf kernel: svc: socket f6ec17c0 busy, not enqueued
Mar 27 08:15:52 sambaperf kernel: svc: socket f50fcda0(inet f6ec17c0),
write_space busy=1
Mar 27 08:15:52 sambaperf kernel: svc: socket f6ec17c0 busy, not enqueued
Mar 27 08:15:52 sambaperf kernel: svc: socket f50fcda0(inet f6ec17c0),
count=140, busy=1
Mar 27 08:15:52 sambaperf kernel: svc: socket f6ec17c0 busy, not enqueued
Mar 27 08:15:52 sambaperf kernel: svc: socket f6ec17c0 busy, not enqueued
Mar 27 08:15:52 sambaperf kernel: svc: service f4cf0a00, releasing skb
f4909340
Mar 27 08:15:52 sambaperf kernel: svc: socket f50fcda0 sendto([f4cd4000
8320... ], 1, 8320) = 8320

I also took one for tcp just to compare:

Mar 27 08:20:28 sambaperf kernel: svc: socket f4988cc0 busy, not enqueued
Mar 27 08:20:28 sambaperf kernel: svc: socket f4a0a360 busy, not enqueued
Mar 27 08:20:28 sambaperf kernel: svc: socket f714c360 sendto([f4e04000
8324... ], 1, 8324) = 8324
Mar 27 08:20:28 sambaperf kernel: svc: socket f4a0a360 busy, not enqueued
Mar 27 08:20:28 sambaperf kernel: svc: server f436e600 waiting for data (to =
30000)
Mar 27 08:20:28 sambaperf kernel: svc: socket f43159a0 busy, not enqueued
Mar 27 08:20:28 sambaperf kernel: svc: server f439b800 waiting for data (to =
30000)
Mar 27 08:20:28 sambaperf kernel: svc: socket f4325040 TCP data ready (svsk
f446e120)
Mar 27 08:20:28 sambaperf kernel: svc: socket f4325040 busy, not enqueued
Mar 27 08:20:29 sambaperf kernel: svc: server f502d400, socket f714c360,
inuse=1

I also setup another debug, where I prinkt'd at all the SK bit set and
clears, and a couple other areas. In this test, I only sent one read request
from one client:

pid: 0 svc: socket f5b97e40(inet f4f29960), count=140, SK_BUSY=0<6> and
setting SK_DATA in svc_udp_data_ready
pid: 0 testing SK_BUSY in svc_sock_enqueue
pid: 0 setting SK_BUSY in svc_sock_enqueue
pid: 1390 clearing SK_DATA in svc_udp_recvfrom
pid: 1390 setting SK_DATA in svc_udp_recvfrom
pid: 1390 clearing SK_BUSY in svc_sock_received
pid: 1390 testing SK_BUSY in svc_sock_enqueue
pid: 1390 setting SK_BUSY in svc_sock_enqueue
pid: 1390 calling svc_process
pid: 1390 testing SK_BUSY in svc_sock_enqueue
pid: 1390 testing SK_BUSY in svc_sock_enqueue
pid: 1387 clearing SK_DATA in svc_udp_recvfrom
pid: 1387 clearing SK_BUSY in svc_sock_received
svc: socket f5b97e40(inet f4f29960), write_space SK_BUSY=0 in svc_write_space
svc: socket f5b97e40(inet f4f29960), write_space SK_BUSY=0 in svc_write_space
svc: socket f5b97e40(inet f4f29960), write_space SK_BUSY=0 in svc_write_space

My only concern was that SK_BUSY was set, then svc_process() is called. I
thought SK_BUSY should not be set during svc_process().

> The fact that you seem to get more threads if you only ask for 2 is
> really odd....

Yes, that really shocked me. I tried it several times. I tried 1 to 8
threads, and all but 2 had nfsd_busy of 1, 2 threads had nsfd_busy of 2 most
of the time, and throughput was up.

I am going to setup SpecSFS for a comparison, but IMO I like this test
because it runs so quickly and exposes a potential problem.

Thanks for your help,

-Andrew

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-03-28 04:03:24

by NeilBrown

[permalink] [raw]
Subject: Re: more SMP issues

On Wednesday March 27, [email protected] wrote:
>
> Here is udp:
>
> Mar 27 08:15:52 sambaperf kernel: svc: socket f6ec17c0 served by daemon
> f4cd8e00
> Mar 27 08:15:52 sambaperf kernel: svc: got len=132
....
>
> I also took one for tcp just to compare:
>
> Mar 27 08:20:28 sambaperf kernel: svc: socket f4988cc0 busy, not enqueued
> Mar 27 08:20:28 sambaperf kernel: svc: socket f4a0a360 busy, not enqueued
> Mar 27 08:20:28 sambaperf kernel: svc: socket f714c360 sendto([f4e04000
> 8324... ], 1, 8324) = 8324
....

Nothing odd here.... thanks.

>
> I also setup another debug, where I prinkt'd at all the SK bit set and
> clears, and a couple other areas. In this test, I only sent one read request
> from one client:
>
> pid: 0 svc: socket f5b97e40(inet f4f29960), count=140, SK_BUSY=0<6> and
> setting SK_DATA in svc_udp_data_ready
> pid: 0 testing SK_BUSY in svc_sock_enqueue
> pid: 0 setting SK_BUSY in svc_sock_enqueue
> pid: 1390 clearing SK_DATA in svc_udp_recvfrom
> pid: 1390 setting SK_DATA in svc_udp_recvfrom
> pid: 1390 clearing SK_BUSY in svc_sock_received
> pid: 1390 testing SK_BUSY in svc_sock_enqueue
> pid: 1390 setting SK_BUSY in svc_sock_enqueue
> pid: 1390 calling svc_process
> pid: 1390 testing SK_BUSY in svc_sock_enqueue
> pid: 1390 testing SK_BUSY in svc_sock_enqueue
> pid: 1387 clearing SK_DATA in svc_udp_recvfrom
> pid: 1387 clearing SK_BUSY in svc_sock_received
> svc: socket f5b97e40(inet f4f29960), write_space SK_BUSY=0 in svc_write_space
> svc: socket f5b97e40(inet f4f29960), write_space SK_BUSY=0 in svc_write_space
> svc: socket f5b97e40(inet f4f29960), write_space SK_BUSY=0 in svc_write_space
>
> My only concern was that SK_BUSY was set, then svc_process() is called. I
> thought SK_BUSY should not be set during svc_process().

No, this is normal.
After taking a request off the socket, the process calls
svc_sock_enqueue incase there is another packet. The socket gets
attached to some process and marked busy, until that process wakes up
and deals with the next packet.... but that gives me an idea....

>
> > The fact that you seem to get more threads if you only ask for 2 is
> > really odd....
>
> Yes, that really shocked me. I tried it several times. I tried 1 to 8
> threads, and all but 2 had nfsd_busy of 1, 2 threads had nsfd_busy of 2 most
> of the time, and throughput was up.

Ok, here is a theory.
All your data is in cache, right?
So when NFSD gets a read request, it hurtles along processing it and
never has to block all the way until it is completed. As this is done
in less than the scheduling interval, it never gets interrupted and no
other nfsd thread gets to run on that CPU until this one has finished
it's request.
This would give you a situation in which you never have more
concurrent nfsd's than you have processors.

But you are getting even worse than that: You have only one active
thread on a quad processor machine. I blithely blame this on the
scheduler.
Maybe all the nfsd threads are initially attached to the same
processor, and for some reason they never get migrated to another
processor. Is this a question for Ingo Molnar??

Maybe when you only have 2 nfsds, the dynamics are a bit different and
one thread gets migrated and you then get dual processor performance.

Possibly we could put some scheduler call in just after the wake_up()
in svcsock.c which says to go bother some other CPU???

Does that make any sense, or sound plausable?

>
> I am going to setup SpecSFS for a comparison, but IMO I like this test
> because it runs so quickly and exposes a potential problem.
>

That's what I was running.... but I had a bug is another
under-development patch and it locked up chasing a circular list at
about 2000 Ops/Sec (That is an unpublished number. I didn't publish
it. You cannot say that I did).

NeilBrown



_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-03-28 04:26:14

by NeilBrown

[permalink] [raw]
Subject: Re: more SMP issues

On Thursday March 28, [email protected] wrote:
>
> Maybe when you only have 2 nfsds, the dynamics are a bit different and
> one thread gets migrated and you then get dual processor performance.
>
> Possibly we could put some scheduler call in just after the wake_up()
> in svcsock.c which says to go bother some other CPU???

Something that might be worth a try is is to change the "list_add"
int svc_serv_enqueue to "list_add_tail".

With list_add, it just uses as many threads as it needs and leaves the
others completely ideal (and so not poluting cache).
With list_add_tail, it will cycle through all the threads, and
that might cause enough turbulence to get all the processors working.

It's not the *right* solution, but it is probably worth a try.

NeilBrown

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-03-28 04:55:09

by Andrew Theurer

[permalink] [raw]
Subject: Re: more SMP issues

> Ok, here is a theory.
> All your data is in cache, right?

Yes

> So when NFSD gets a read request, it hurtles along processing it and
> never has to block all the way until it is completed. As this is done
> in less than the scheduling interval, it never gets interrupted and no
> other nfsd thread gets to run on that CPU until this one has finished
> it's request.
> This would give you a situation in which you never have more
> concurrent nfsd's than you have processors.

Yes, that seems right. That actually would not be so bad. At least I could
achive 100% cpu.

> But you are getting even worse than that: You have only one active
> thread on a quad processor machine. I blithely blame this on the
> scheduler.
> Maybe all the nfsd threads are initially attached to the same
> processor, and for some reason they never get migrated to another
> processor. Is this a question for Ingo Molnar??

I think you are on to something here. Actually, when monitoring top, I
always have exactly one CPU at 100%, while the others are at ~10%
(interrupts?). Eventually the 100% moves from one processor to another.
Sounds like the nfsd threads all wake up and stay on the same processor. I
need to look at how O(1) handles wake_up, and whether it puts all tasks on
the same runqueue, or distributes them amongst <num_cpu> runqueues.

> Maybe when you only have 2 nfsds, the dynamics are a bit different and
> one thread gets migrated and you then get dual processor performance.
>
> Possibly we could put some scheduler call in just after the wake_up()
> in svcsock.c which says to go bother some other CPU???
>
> Does that make any sense, or sound plausable?

Yes. I could also retest with 4 nfsd threads, each bound to a unique CPU.
I have not looked at Ingo's latest cpu affinity work on O(1), but I'm sure
it can be done. At some point I wanted to do some experimetation with
process and IRQ affinity anyway. With samba/netbench I could get 25% better
performance. In either case, I'll try this and list_add_tail in
svc_serv_enqueue.

> > I am going to setup SpecSFS for a comparison, but IMO I like this test
> > because it runs so quickly and exposes a potential problem.
> >
>
> That's what I was running.... but I had a bug is another
> under-development patch and it locked up chasing a circular list at
> about 2000 Ops/Sec (That is an unpublished number. I didn't publish
> it. You cannot say that I did).

I just got this started today, and I'm having a kernel panic on 1000 Ops in
2.5.7, and I'm not even out of the init phase. Soon I'll have some more
data. So far I did see that nfsd_busy get to 12! However my runqueue
length was still 2 or less. This obviuosly invloved disk IO (disk was
constantly going). I need a little more time get get SFS running well to
figure out whats gong on.

Thanks,

-Andrew



_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs