From: Wenji Wu <[email protected]>
Greetings,
For Linux TCP, when the network applcaiton make system call to move data
from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will
be locked. During the period, all the incoming packet for the TCP socket
will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog
queue
will not be TCP processed till the network applicaton resume its execution.
If
the system is heavily loaded, TCP can easily RTO in the Sender Side.
Attached is the Changelog for the patch
best regards,
wenji
Wenji Wu
Network Researcher
Fermilab, MS-368
P.O. Box 500
Batavia, IL, 60510
(Email): [email protected]
(O): 001-630-840-4541
From: Wenji Wu <[email protected]>
Greetings,
For Linux TCP, when the network applcaiton make system call to move data
from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will
be locked. During the period, all the incoming packet for the TCP socket
will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog
queue
will not be TCP processed till the network applicaton resume its execution.
If
the system is heavily loaded, TCP can easily RTO in the Sender Side.
Attached is the patch 1/4
best regards,
wenji
Wenji Wu
Network Researcher
Fermilab, MS-368
P.O. Box 500
Batavia, IL, 60510
(Email): [email protected]
(O): 001-630-840-4541
From: Wenji Wu <[email protected]>
Greetings,
For Linux TCP, when the network applcaiton make system call to move data
from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will
be locked. During the period, all the incoming packet for the TCP socket
will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog
queue
will not be TCP processed till the network applicaton resume its execution.
If
the system is heavily loaded, TCP can easily RTO in the Sender Side.
Attached is the patch 2/4
best regards,
wenji
Wenji Wu
Network Researcher
Fermilab, MS-368
P.O. Box 500
Batavia, IL, 60510
(Email): [email protected]
(O): 001-630-840-4541
From: Wenji Wu <[email protected]>
Greetings,
For Linux TCP, when the network applcaiton make system call to move data
from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will
be locked. During the period, all the incoming packet for the TCP socket
will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog
queue
will not be TCP processed till the network applicaton resume its execution.
If
the system is heavily loaded, TCP can easily RTO in the Sender Side.
Attached is the patch 3/4
best regards,
wenji
Wenji Wu
Network Researcher
Fermilab, MS-368
P.O. Box 500
Batavia, IL, 60510
(Email): [email protected]
(O): 001-630-840-4541
From: Wenji Wu <[email protected]>
Greetings,
For Linux TCP, when the network applcaiton make system call to move data
from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will
be locked. During the period, all the incoming packet for the TCP socket
will go
to the backlog queue without being TCP processed. Since Linux 2.6 can be
inerrupted mid-task, if the network application expires, and moved to the
expired array with the socket locked, all the packets within the backlog
queue
will not be TCP processed till the network applicaton resume its execution.
If
the system is heavily loaded, TCP can easily RTO in the Sender Side.
Attached is the patch 3/4
best regards,
wenji
Wenji Wu
Network Researcher
Fermilab, MS-368
P.O. Box 500
Batavia, IL, 60510
(Email): [email protected]
(O): 001-630-840-4541
Wenji Wu wrote:
> From: Wenji Wu <[email protected]>
>
> Greetings,
>
> For Linux TCP, when the network applcaiton make system call to move data
> from
> socket's receive buffer to user space by calling tcp_recvmsg(). The socket
> will
> be locked. During the period, all the incoming packet for the TCP socket
> will go
> to the backlog queue without being TCP processed. Since Linux 2.6 can be
> inerrupted mid-task, if the network application expires, and moved to the
> expired array with the socket locked, all the packets within the backlog
> queue
> will not be TCP processed till the network applicaton resume its execution.
> If
> the system is heavily loaded, TCP can easily RTO in the Sender Side.
So how much difference did this patch actually make, and to what
benchmark?
> The patch is for Linux kernel 2.6.14 Deskop and Low-latency Desktop
The patch oesn't seem to be attached? Also, would be better to make
it for the latest kernel version (2.6.19) ... 2.6.14 is rather old ;-)
M
Please, it is very difficult to review your work the way you have
submitted this patch as a set of 4 patches. These patches have not
been split up "logically", but rather they have been split up "per
file" with the same exact changelog message in each patch posting.
This is very clumsy, and impossible to review, and wastes a lot of
mailing list bandwith.
We have an excellent file, called Documentation/SubmittingPatches, in
the kernel source tree, which explains exactly how to do this
correctly.
By splitting your patch into 4 patches, one for each file touched,
it is impossible to review your patch as a logical whole.
Please also provide your patch inline so people can just hit reply
in their mail reader client to quote your patch and comment on it.
This is impossible with the attachments you've used.
Thanks.
On Wed, 29 Nov 2006 16:53:11 -0800 (PST)
David Miller <[email protected]> wrote:
>
> Please, it is very difficult to review your work the way you have
> submitted this patch as a set of 4 patches. These patches have not
> been split up "logically", but rather they have been split up "per
> file" with the same exact changelog message in each patch posting.
> This is very clumsy, and impossible to review, and wastes a lot of
> mailing list bandwith.
>
> We have an excellent file, called Documentation/SubmittingPatches, in
> the kernel source tree, which explains exactly how to do this
> correctly.
>
> By splitting your patch into 4 patches, one for each file touched,
> it is impossible to review your patch as a logical whole.
>
> Please also provide your patch inline so people can just hit reply
> in their mail reader client to quote your patch and comment on it.
> This is impossible with the attachments you've used.
>
Here you go - joined up, cleaned up, ported to mainline and test-compiled.
That yield() will need to be removed - yield()'s behaviour is truly awful
if the system is otherwise busy. What is it there for?
From: Wenji Wu <[email protected]>
For Linux TCP, when the network applcaiton make system call to move data from
socket's receive buffer to user space by calling tcp_recvmsg(). The socket
will be locked. During this period, all the incoming packet for the TCP
socket will go to the backlog queue without being TCP processed
Since Linux 2.6 can be inerrupted mid-task, if the network application
expires, and moved to the expired array with the socket locked, all the
packets within the backlog queue will not be TCP processed till the network
applicaton resume its execution. If the system is heavily loaded, TCP can
easily RTO in the Sender Side.
include/linux/sched.h | 2 ++
kernel/fork.c | 3 +++
kernel/sched.c | 24 ++++++++++++++++++------
net/ipv4/tcp.c | 9 +++++++++
4 files changed, 32 insertions(+), 6 deletions(-)
diff -puN net/ipv4/tcp.c~tcp-speedup net/ipv4/tcp.c
--- a/net/ipv4/tcp.c~tcp-speedup
+++ a/net/ipv4/tcp.c
@@ -1109,6 +1109,8 @@ int tcp_recvmsg(struct kiocb *iocb, stru
struct task_struct *user_recv = NULL;
int copied_early = 0;
+ current->backlog_flag = 1;
+
lock_sock(sk);
TCP_CHECK_TIMER(sk);
@@ -1468,6 +1470,13 @@ skip_copy:
TCP_CHECK_TIMER(sk);
release_sock(sk);
+
+ current->backlog_flag = 0;
+ if (current->extrarun_flag == 1){
+ current->extrarun_flag = 0;
+ yield();
+ }
+
return copied;
out:
diff -puN include/linux/sched.h~tcp-speedup include/linux/sched.h
--- a/include/linux/sched.h~tcp-speedup
+++ a/include/linux/sched.h
@@ -1023,6 +1023,8 @@ struct task_struct {
#ifdef CONFIG_TASK_DELAY_ACCT
struct task_delay_info *delays;
#endif
+ int backlog_flag; /* packets wait in tcp backlog queue flag */
+ int extrarun_flag; /* extra run flag for TCP performance */
};
static inline pid_t process_group(struct task_struct *tsk)
diff -puN kernel/sched.c~tcp-speedup kernel/sched.c
--- a/kernel/sched.c~tcp-speedup
+++ a/kernel/sched.c
@@ -3099,12 +3099,24 @@ void scheduler_tick(void)
if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
- if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
- enqueue_task(p, rq->expired);
- if (p->static_prio < rq->best_expired_prio)
- rq->best_expired_prio = p->static_prio;
- } else
- enqueue_task(p, rq->active);
+ if (p->backlog_flag == 0) {
+ if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
+ enqueue_task(p, rq->expired);
+ if (p->static_prio < rq->best_expired_prio)
+ rq->best_expired_prio = p->static_prio;
+ } else
+ enqueue_task(p, rq->active);
+ } else {
+ if (expired_starving(rq)) {
+ enqueue_task(p,rq->expired);
+ if (p->static_prio < rq->best_expired_prio)
+ rq->best_expired_prio = p->static_prio;
+ } else {
+ if (!TASK_INTERACTIVE(p))
+ p->extrarun_flag = 1;
+ enqueue_task(p,rq->active);
+ }
+ }
} else {
/*
* Prevent a too long timeslice allowing a task to monopolize
diff -puN kernel/fork.c~tcp-speedup kernel/fork.c
--- a/kernel/fork.c~tcp-speedup
+++ a/kernel/fork.c
@@ -1032,6 +1032,9 @@ static struct task_struct *copy_process(
clear_tsk_thread_flag(p, TIF_SIGPENDING);
init_sigpending(&p->pending);
+ p->backlog_flag = 0;
+ p->extrarun_flag = 0;
+
p->utime = cputime_zero;
p->stime = cputime_zero;
p->sched_time = 0;
_
From: Andrew Morton <[email protected]>
Date: Wed, 29 Nov 2006 17:08:35 -0800
> On Wed, 29 Nov 2006 16:53:11 -0800 (PST)
> David Miller <[email protected]> wrote:
>
> >
> > Please, it is very difficult to review your work the way you have
> > submitted this patch as a set of 4 patches. These patches have not
> > been split up "logically", but rather they have been split up "per
> > file" with the same exact changelog message in each patch posting.
> > This is very clumsy, and impossible to review, and wastes a lot of
> > mailing list bandwith.
> >
> > We have an excellent file, called Documentation/SubmittingPatches, in
> > the kernel source tree, which explains exactly how to do this
> > correctly.
> >
> > By splitting your patch into 4 patches, one for each file touched,
> > it is impossible to review your patch as a logical whole.
> >
> > Please also provide your patch inline so people can just hit reply
> > in their mail reader client to quote your patch and comment on it.
> > This is impossible with the attachments you've used.
> >
>
> Here you go - joined up, cleaned up, ported to mainline and test-compiled.
>
> That yield() will need to be removed - yield()'s behaviour is truly awful
> if the system is otherwise busy. What is it there for?
What about simply turning off CONFIG_PREEMPT to fix this "problem"?
We always properly run the backlog (by doing a release_sock()) before
going to sleep otherwise except for the specific case of taking a page
fault during the copy to userspace. It is only CONFIG_PREEMPT that
can cause this situation to occur in other circumstances as far as I
can see.
We could also pepper tcp_recvmsg() with some very carefully placed
preemption disable/enable calls to deal with this even with
CONFIG_PREEMPT enabled.
Yes, when CONFIG_PREEMPT is disabled, the "problem" won't happen. That is why I put "for 2.6 desktop, low-latency desktop" in the uploaded paper. This "problem" happens in the 2.6 Desktop and Low-latency Desktop.
>We could also pepper tcp_recvmsg() with some very carefully placed preemption disable/enable calls to deal with this even with CONFIG_PREEMPT enabled.
I also think about this approach. But since the "problem" happens in the 2.6 Desktop and Low-latency Desktop (not server), system responsiveness is a key feature, simply placing preemption disabled/enable call might not work. If you want to place preemption disable/enable calls within tcp_recvmsg, you have to put them in the very beginning and end of the call. Disabling preemption would degrade system responsiveness.
wenji
----- Original Message -----
From: David Miller <[email protected]>
Date: Wednesday, November 29, 2006 7:13 pm
Subject: Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
> From: Andrew Morton <[email protected]>
> Date: Wed, 29 Nov 2006 17:08:35 -0800
>
> > On Wed, 29 Nov 2006 16:53:11 -0800 (PST)
> > David Miller <[email protected]> wrote:
> >
> > >
> > > Please, it is very difficult to review your work the way you have
> > > submitted this patch as a set of 4 patches. These patches have
> not> > been split up "logically", but rather they have been split
> up "per
> > > file" with the same exact changelog message in each patch posting.
> > > This is very clumsy, and impossible to review, and wastes a lot of
> > > mailing list bandwith.
> > >
> > > We have an excellent file, called
> Documentation/SubmittingPatches, in
> > > the kernel source tree, which explains exactly how to do this
> > > correctly.
> > >
> > > By splitting your patch into 4 patches, one for each file touched,
> > > it is impossible to review your patch as a logical whole.
> > >
> > > Please also provide your patch inline so people can just hit reply
> > > in their mail reader client to quote your patch and comment on it.
> > > This is impossible with the attachments you've used.
> > >
> >
> > Here you go - joined up, cleaned up, ported to mainline and test-
> compiled.>
> > That yield() will need to be removed - yield()'s behaviour is
> truly awful
> > if the system is otherwise busy. What is it there for?
>
> What about simply turning off CONFIG_PREEMPT to fix this "problem"?
>
> We always properly run the backlog (by doing a release_sock()) before
> going to sleep otherwise except for the specific case of taking a page
> fault during the copy to userspace. It is only CONFIG_PREEMPT that
> can cause this situation to occur in other circumstances as far as I
> can see.
>
> We could also pepper tcp_recvmsg() with some very carefully placed
> preemption disable/enable calls to deal with this even with
> CONFIG_PREEMPT enabled.
>
> That yield() will need to be removed - yield()'s behaviour is truly
> awfulif the system is otherwise busy. What is it there for?
Please read the uploaded paper, which has detailed description.
thanks,
wenji
----- Original Message -----
From: Andrew Morton <[email protected]>
Date: Wednesday, November 29, 2006 7:08 pm
Subject: Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
> On Wed, 29 Nov 2006 16:53:11 -0800 (PST)
> David Miller <[email protected]> wrote:
>
> >
> > Please, it is very difficult to review your work the way you have
> > submitted this patch as a set of 4 patches. These patches have not
> > been split up "logically", but rather they have been split up "per
> > file" with the same exact changelog message in each patch posting.
> > This is very clumsy, and impossible to review, and wastes a lot of
> > mailing list bandwith.
> >
> > We have an excellent file, called
> Documentation/SubmittingPatches, in
> > the kernel source tree, which explains exactly how to do this
> > correctly.
> >
> > By splitting your patch into 4 patches, one for each file touched,
> > it is impossible to review your patch as a logical whole.
> >
> > Please also provide your patch inline so people can just hit reply
> > in their mail reader client to quote your patch and comment on it.
> > This is impossible with the attachments you've used.
> >
>
> Here you go - joined up, cleaned up, ported to mainline and test-
> compiled.
> That yield() will need to be removed - yield()'s behaviour is truly
> awfulif the system is otherwise busy. What is it there for?
>
>
>
> From: Wenji Wu <[email protected]>
>
> For Linux TCP, when the network applcaiton make system call to move
> data from
> socket's receive buffer to user space by calling tcp_recvmsg().
> The socket
> will be locked. During this period, all the incoming packet for
> the TCP
> socket will go to the backlog queue without being TCP processed
>
> Since Linux 2.6 can be inerrupted mid-task, if the network application
> expires, and moved to the expired array with the socket locked, all
> thepackets within the backlog queue will not be TCP processed till
> the network
> applicaton resume its execution. If the system is heavily loaded,
> TCP can
> easily RTO in the Sender Side.
>
>
>
> include/linux/sched.h | 2 ++
> kernel/fork.c | 3 +++
> kernel/sched.c | 24 ++++++++++++++++++------
> net/ipv4/tcp.c | 9 +++++++++
> 4 files changed, 32 insertions(+), 6 deletions(-)
>
> diff -puN net/ipv4/tcp.c~tcp-speedup net/ipv4/tcp.c
> --- a/net/ipv4/tcp.c~tcp-speedup
> +++ a/net/ipv4/tcp.c
> @@ -1109,6 +1109,8 @@ int tcp_recvmsg(struct kiocb *iocb, stru
> struct task_struct *user_recv = NULL;
> int copied_early = 0;
>
> + current->backlog_flag = 1;
> +
> lock_sock(sk);
>
> TCP_CHECK_TIMER(sk);
> @@ -1468,6 +1470,13 @@ skip_copy:
>
> TCP_CHECK_TIMER(sk);
> release_sock(sk);
> +
> + current->backlog_flag = 0;
> + if (current->extrarun_flag == 1){
> + current->extrarun_flag = 0;
> + yield();
> + }
> +
> return copied;
>
> out:
> diff -puN include/linux/sched.h~tcp-speedup include/linux/sched.h
> --- a/include/linux/sched.h~tcp-speedup
> +++ a/include/linux/sched.h
> @@ -1023,6 +1023,8 @@ struct task_struct {
> #ifdef CONFIG_TASK_DELAY_ACCT
> struct task_delay_info *delays;
> #endif
> + int backlog_flag; /* packets wait in tcp backlog queue flag */
> + int extrarun_flag; /* extra run flag for TCP performance */
> };
>
> static inline pid_t process_group(struct task_struct *tsk)
> diff -puN kernel/sched.c~tcp-speedup kernel/sched.c
> --- a/kernel/sched.c~tcp-speedup
> +++ a/kernel/sched.c
> @@ -3099,12 +3099,24 @@ void scheduler_tick(void)
>
> if (!rq->expired_timestamp)
> rq->expired_timestamp = jiffies;
> - if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
> - enqueue_task(p, rq->expired);
> - if (p->static_prio < rq->best_expired_prio)
> - rq->best_expired_prio = p->static_prio;
> - } else
> - enqueue_task(p, rq->active);
> + if (p->backlog_flag == 0) {
> + if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
> + enqueue_task(p, rq->expired);
> + if (p->static_prio < rq->best_expired_prio)
> + rq->best_expired_prio = p-
> >static_prio;+ } else
> + enqueue_task(p, rq->active);
> + } else {
> + if (expired_starving(rq)) {
> + enqueue_task(p,rq->expired);
> + if (p->static_prio < rq->best_expired_prio)
> + rq->best_expired_prio = p-
> >static_prio;+ } else {
> + if (!TASK_INTERACTIVE(p))
> + p->extrarun_flag = 1;
> + enqueue_task(p,rq->active);
> + }
> + }
> } else {
> /*
> * Prevent a too long timeslice allowing a task to
> monopolizediff -puN kernel/fork.c~tcp-speedup kernel/fork.c
> --- a/kernel/fork.c~tcp-speedup
> +++ a/kernel/fork.c
> @@ -1032,6 +1032,9 @@ static struct task_struct *copy_process(
> clear_tsk_thread_flag(p, TIF_SIGPENDING);
> init_sigpending(&p->pending);
>
> + p->backlog_flag = 0;
> + p->extrarun_flag = 0;
> +
> p->utime = cputime_zero;
> p->stime = cputime_zero;
> p->sched_time = 0;
> _
>
>
From: Wenji Wu <[email protected]>
Date: Wed, 29 Nov 2006 19:56:58 -0600
> >We could also pepper tcp_recvmsg() with some very carefully placed
> >preemption disable/enable calls to deal with this even with
> >CONFIG_PREEMPT enabled.
>
> I also think about this approach. But since the "problem" happens in
> the 2.6 Desktop and Low-latency Desktop (not server), system
> responsiveness is a key feature, simply placing preemption
> disabled/enable call might not work. If you want to place
> preemption disable/enable calls within tcp_recvmsg, you have to put
> them in the very beginning and end of the call. Disabling preemption
> would degrade system responsiveness.
We can make explicitl preemption checks in the main loop of
tcp_recvmsg(), and release the socket and run the backlog if
need_resched() is TRUE.
This is the simplest and most elegant solution to this problem.
The one suggested in your patch and paper are way overkill, there is
no reason to solve a TCP specific problem inside of the generic
scheduler.
On Wed, 2006-11-29 at 17:08 -0800, Andrew Morton wrote:
> + if (p->backlog_flag == 0) {
> + if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
> + enqueue_task(p, rq->expired);
> + if (p->static_prio < rq->best_expired_prio)
> + rq->best_expired_prio = p->static_prio;
> + } else
> + enqueue_task(p, rq->active);
> + } else {
> + if (expired_starving(rq)) {
> + enqueue_task(p,rq->expired);
> + if (p->static_prio < rq->best_expired_prio)
> + rq->best_expired_prio = p->static_prio;
> + } else {
> + if (!TASK_INTERACTIVE(p))
> + p->extrarun_flag = 1;
> + enqueue_task(p,rq->active);
> + }
> + }
(oh my, doing that to the scheduler upsets my tummy, but that aside...)
I don't see how that can really solve anything. "Interactive" tasks
starting to use cpu heftily can still preempt and keep the special cased
cpu hog off the cpu for ages. It also only takes one task in the
expired array to trigger the forced array switch with a fully loaded
cpu, and once any task hits the expired array, a stream of wakeups can
prevent the switch from completing for as long as you can keep wakeups
happening.
-Mike
* Wenji Wu <[email protected]> wrote:
> > That yield() will need to be removed - yield()'s behaviour is truly
> > awfulif the system is otherwise busy. What is it there for?
>
> Please read the uploaded paper, which has detailed description.
do you have any URL for that?
Ingo
* David Miller <[email protected]> wrote:
> We can make explicitl preemption checks in the main loop of
> tcp_recvmsg(), and release the socket and run the backlog if
> need_resched() is TRUE.
>
> This is the simplest and most elegant solution to this problem.
yeah, i like this one. If the problem is "too long locked section", then
the most natural solution is to "break up the lock", not to "boost the
priority of the lock-holding task" (which is what the proposed patch
does).
[ Also note that "sprinkle the code with preempt_disable()" kind of
solutions, besides hurting interactivity, are also a pain to resolve
in something like PREEMPT_RT. (unlike say a spinlock,
preempt_disable() is quite opaque in what data structure it protects,
etc., making it hard to convert it to a preemptible primitive) ]
> The one suggested in your patch and paper are way overkill, there is
> no reason to solve a TCP specific problem inside of the generic
> scheduler.
agreed.
What we could also add is a /reverse/ mechanism to the scheduler: a task
could query whether it has just a small amount of time left in its
timeslice, and could in that case voluntarily drop its current lock and
yield, and thus give up its current timeslice and wait for a new, full
timeslice, instead of being forcibly preempted due to lack of timeslices
with a possibly critical lock still held.
But the suggested solution here, to "prolong the running of this task
just a little bit longer" only starts a perpetual arms race between
users of such a facility and other kernel subsystems. (besides not being
adequate anyway, there can always be /so/ long lock-hold times that the
scheduler would have no other option but to preempt the task)
Ingo
From: Ingo Molnar <[email protected]>
Date: Thu, 30 Nov 2006 07:17:58 +0100
>
> * David Miller <[email protected]> wrote:
>
> > We can make explicitl preemption checks in the main loop of
> > tcp_recvmsg(), and release the socket and run the backlog if
> > need_resched() is TRUE.
> >
> > This is the simplest and most elegant solution to this problem.
>
> yeah, i like this one. If the problem is "too long locked section", then
> the most natural solution is to "break up the lock", not to "boost the
> priority of the lock-holding task" (which is what the proposed patch
> does).
Ingo you're mis-read the problem :-)
The issue is that we actually don't hold any locks that prevent
preemption, so we can take preemption points which the TCP code
wasn't designed with in-mind.
Normally, we control the sleep point very carefully in the TCP
sendmsg/recvmsg code, such that when we sleep we drop the socket
lock and process the backlog packets that accumulated while the
socket was locked.
With pre-emption we can't control that properly.
The problem is that we really do need to run the backlog any time
we give up the cpu in the sendmsg/recvmsg path, or things get real
erratic. ACKs don't go out as early as we'd like them to, etc.
It isn't easy to do generically, perhaps, because we can only
drop the socket lock at certain points and we need to do that to
run the backlog.
This is why my suggestion is to preempt_disable() as soon as we
grab the socket lock, and explicitly test need_resched() at places
where it is absolutely safe, like this:
if (need_resched()) {
/* Run packet backlog... */
release_sock(sk);
schedule();
lock_sock(sk);
}
The socket lock is just a by-hand binary semaphore, so it doesn't
block pre-emption. We have to be able to sleep while holding it.
* David Miller <[email protected]> wrote:
> > yeah, i like this one. If the problem is "too long locked section",
> > then the most natural solution is to "break up the lock", not to
> > "boost the priority of the lock-holding task" (which is what the
> > proposed patch does).
>
> Ingo you're mis-read the problem :-)
yeah, the problem isnt too long locked section but "too much time spent
holding a lock" and hence opening up ourselves to possible negative
side-effects of the scheduler's fairness algorithm when it forces a
preemption of that process context with that lock held (and forcing all
subsequent packets to be backlogged).
but please read my last mail - i think i'm slowly starting to wake up
;-) I dont think there is any real problem: a tweak to the scheduler
that in essence gives TCP-using tasks a preference changes the balance
of workloads. Such an explicit tweak is possible already.
furthermore, the tweak allows the shifting of processing from a
prioritized process context into a highest-priority softirq context.
(it's not proven that there is any significant /net win/ of performance:
all that was proven is that if we shift TCP processing from process
context into softirq context then TCP throughput of that otherwise
penalized process context increases.)
Ingo
* David Miller <[email protected]> wrote:
> This is why my suggestion is to preempt_disable() as soon as we grab
> the socket lock, [...]
independently of the issue at hand, in general the explicit use of
preempt_disable() in non-infrastructure code is quite a heavy tool. Its
effects are heavy and global: it disables /all/ preemption (even on
PREEMPT_RT). Furthermore, when preempt_disable() is used for per-CPU
data structures then [unlike for example to a spin-lock] the connection
between the 'data' and the 'lock' is not explicit - causing all kinds of
grief when trying to convert such code to a different preemption model.
(such as PREEMPT_RT :-)
So my plan is to remove all "open-coded" use of preempt_disable() [and
raw use of local_irq_save/restore] from the kernel and replace it with
some facility that connects data and lock. (Note that this will not
result in any actual changes on the instruction level because internally
every such facility still maps to preempt_disable() on non-PREEMPT_RT
kernels, so on non-PREEMPT_RT kernels such code will still be the same
as before.)
Ingo
From: Ingo Molnar <[email protected]>
Date: Thu, 30 Nov 2006 07:47:58 +0100
> furthermore, the tweak allows the shifting of processing from a
> prioritized process context into a highest-priority softirq context.
> (it's not proven that there is any significant /net win/ of performance:
> all that was proven is that if we shift TCP processing from process
> context into softirq context then TCP throughput of that otherwise
> penalized process context increases.)
If we preempt with any packets in the backlog, we send no ACKs and the
sender cannot send thus the pipe empties. That's the problem, this
has nothing to do with scheduler priorities or stuff like that IMHO.
The argument goes that if the reschedule is delayed long enough, the
ACKs will exceed the round trip time and trigger retransmits which
will absolutely kill performance.
The only reason we block input packet processing while we hold this
lock is because we don't want the receive queue changing from
underneath us while we're copying data to userspace.
Furthermore once you preempt in this particular way, no input
packet processing occurs in that socket still, exacerbating the
situation.
Anyways, even if we somehow unlocked the socket and ran the backlog at
preemption points, by hand, since we've thus deferred the whole work
of processing whatever is in the backlog until the preemption point,
we've lost our quantum already, so it's perhaps not legal to do the
deferred processing as the preemption signalling point from a fairness
perspective.
It would be different if we really did the packet processing at the
original moment (where we had to queue to the socket backlog because
it was locked, in softirq) because then we'd return from the softirq
and hit the preemption point earlier or whatever.
Therefore, perhaps the best would be to see if there is a way we can
still allow input packet processing even while running the majority of
TCP's recvmsg(). It won't be easy :)
* David Miller <[email protected]> wrote:
> > furthermore, the tweak allows the shifting of processing from a
> > prioritized process context into a highest-priority softirq context.
> > (it's not proven that there is any significant /net win/ of
> > performance: all that was proven is that if we shift TCP processing
> > from process context into softirq context then TCP throughput of
> > that otherwise penalized process context increases.)
>
> If we preempt with any packets in the backlog, we send no ACKs and the
> sender cannot send thus the pipe empties. That's the problem, this
> has nothing to do with scheduler priorities or stuff like that IMHO.
> The argument goes that if the reschedule is delayed long enough, the
> ACKs will exceed the round trip time and trigger retransmits which
> will absolutely kill performance.
yes, but i disagree a bit about the characterisation of the problem. The
question in my opinion is: how is TCP processing prioritized for this
particular socket, which is attached to the process context which was
preempted.
normally, normally quite a bit of TCP processing happens in a softirq
context (in fact most of it happens there), and softirq contexts have no
fairness whatsoever - they preempt whatever processing is going on,
regardless of any priority preferences of the user!
what was observed here were the effects of completely throttling TCP
processing for a given socket. I think such throttling can in fact be
desirable: there is a /reason/ why the process context was preempted: in
that load scenario there was 10 times more processing requested from the
CPU than it can possibly service. It's a serious overload situation and
it's the scheduler's task to prioritize between workloads!
normally such kind of "throttling" of the TCP stack for this particular
socket does not happen. Note that there's no performance lost: we dont
do TCP processing because there are /9 other tasks for this CPU to run/,
and the scheduler has a tough choice.
Now i agree that there are more intelligent ways to throttle and less
intelligent ways to throttle, but the notion to allow a given workload
'steal' CPU time from other workloads by allowing it to push its
processing into a softirq is i think unfair. (and this issue is
partially addressed by my softirq threading patches in -rt :-)
Ingo
On Wed, Nov 29, 2006 at 07:56:58PM -0600, Wenji Wu wrote:
> Yes, when CONFIG_PREEMPT is disabled, the "problem" won't happen. That is why I put "for 2.6 desktop, low-latency desktop" in the uploaded paper. This "problem" happens in the 2.6 Desktop and Low-latency Desktop.
CONFIG_PREEMPT is only for people that are in for the feeling. There is no
real world advtantage to it and we should probably remove it again.
On Thu, Nov 30, 2006 at 08:35:04AM +0100, Ingo Molnar ([email protected]) wrote:
> what was observed here were the effects of completely throttling TCP
> processing for a given socket. I think such throttling can in fact be
> desirable: there is a /reason/ why the process context was preempted: in
> that load scenario there was 10 times more processing requested from the
> CPU than it can possibly service. It's a serious overload situation and
> it's the scheduler's task to prioritize between workloads!
>
> normally such kind of "throttling" of the TCP stack for this particular
> socket does not happen. Note that there's no performance lost: we dont
> do TCP processing because there are /9 other tasks for this CPU to run/,
> and the scheduler has a tough choice.
>
> Now i agree that there are more intelligent ways to throttle and less
> intelligent ways to throttle, but the notion to allow a given workload
> 'steal' CPU time from other workloads by allowing it to push its
> processing into a softirq is i think unfair. (and this issue is
> partially addressed by my softirq threading patches in -rt :-)
Doesn't the provided solution is just a in-kernel variant of the
SCHED_FIFO set from userspace? Why kernel should be able to mark some
users as having higher priority?
What if workload of the system is targeted to not the maximum TCP
performance, but maximum other-task performance, which will be broken
with provided patch.
> Ingo
--
Evgeniy Polyakov
Evgeniy Polyakov wrote:
> On Thu, Nov 30, 2006 at 08:35:04AM +0100, Ingo Molnar ([email protected]) wrote:
> Doesn't the provided solution is just a in-kernel variant of the
> SCHED_FIFO set from userspace? Why kernel should be able to mark some
> users as having higher priority?
> What if workload of the system is targeted to not the maximum TCP
> performance, but maximum other-task performance, which will be broken
> with provided patch.
David's line of thinking for a solution sounds better to me. This patch
does not prevent the process from being preempted (for potentially a long
time), by any means.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Thu, Nov 30, 2006 at 09:07:42PM +1100, Nick Piggin ([email protected]) wrote:
> >Doesn't the provided solution is just a in-kernel variant of the
> >SCHED_FIFO set from userspace? Why kernel should be able to mark some
> >users as having higher priority?
> >What if workload of the system is targeted to not the maximum TCP
> >performance, but maximum other-task performance, which will be broken
> >with provided patch.
>
> David's line of thinking for a solution sounds better to me. This patch
> does not prevent the process from being preempted (for potentially a long
> time), by any means.
It steals timeslices from other processes to complete tcp_recvmsg()
task, and only when it does it for too long, it will be preempted.
Processing backlog queue on behalf of need_resched() will break fairness
too - processing itself can take a lot of time, so process can be
scheduled away in that part too.
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com
--
Evgeniy Polyakov
* Evgeniy Polyakov <[email protected]> wrote:
> > David's line of thinking for a solution sounds better to me. This
> > patch does not prevent the process from being preempted (for
> > potentially a long time), by any means.
>
> It steals timeslices from other processes to complete tcp_recvmsg()
> task, and only when it does it for too long, it will be preempted.
> Processing backlog queue on behalf of need_resched() will break
> fairness too - processing itself can take a lot of time, so process
> can be scheduled away in that part too.
correct - it's just the wrong thing to do. The '10% performance win'
that was measured was against _9 other tasks who contended for the same
CPU resource_. I.e. it's /not/ an absolute 'performance win' AFAICS,
it's a simple shift in CPU cycles away from the other 9 tasks and
towards the task that does TCP receive.
Note that even without the change the TCP receiving task is already
getting a disproportionate share of cycles due to softirq processing!
Under a load of 10.0 it went from 500 mbits to 74 mbits, while the
'fair' share would be 50 mbits. So the TCP receiver /already/ has an
unfair advantage. The patch only deepends that unfairness.
The solution is really simple and needs no kernel change at all: if you
want the TCP receiver to get a larger share of timeslices then either
renice it to -20 or renice the other tasks to +19.
The other disadvantage, even ignoring that it's the wrong thing to do,
is the crudeness of preempt_disable() that i mentioned in the other
post:
---------->
independently of the issue at hand, in general the explicit use of
preempt_disable() in non-infrastructure code is quite a heavy tool. Its
effects are heavy and global: it disables /all/ preemption (even on
PREEMPT_RT). Furthermore, when preempt_disable() is used for per-CPU
data structures then [unlike for example to a spin-lock] the connection
between the 'data' and the 'lock' is not explicit - causing all kinds of
grief when trying to convert such code to a different preemption model.
(such as PREEMPT_RT :-)
So my plan is to remove all "open-coded" use of preempt_disable() [and
raw use of local_irq_save/restore] from the kernel and replace it with
some facility that connects data and lock. (Note that this will not
result in any actual changes on the instruction level because internally
every such facility still maps to preempt_disable() on non-PREEMPT_RT
kernels, so on non-PREEMPT_RT kernels such code will still be the same
as before.)
Ingo
>We can make explicitl preemption checks in the main loop of
>tcp_recvmsg(), and release the socket and run the backlog if
>need_resched() is TRUE.
>This is the simplest and most elegant solution to this problem.
I am not sure whether this approach will work. How can you make the explicit
preemption checks?
For Desktop case, yes, you can make the explicit preemption checks at some
points whether need_resched() is true. But when need_resched() is true, you
can not decide whether it is triggered by higher priority processes becoming
runnable, or the process within tcp_recvmsg being expiring.
If the higher prioirty processes become runnable (e.g., interactive
process), you better yield the CPU, instead of continuing this process. If
it is the case that the process within tcp_recvmsg() is expriring, then, you
can continue the process to go ahead to process backlog.
For Low-latency Desktop case, I believe it is very hard to make the checks.
We do not know when the process is going to expire, or when higher priority
process will become runnable. The process could expire at any moment, or
higher priority process could become runnnable at any moment. If we do not
want to tradeoff system responsiveness, where do you want to make the check?
If you just make the chekc, then need_resched() become TRUE, what are you
going to do in this case?
wenji
On Thu, 2006-11-30 at 09:33 +0000, Christoph Hellwig wrote:
> On Wed, Nov 29, 2006 at 07:56:58PM -0600, Wenji Wu wrote:
> > Yes, when CONFIG_PREEMPT is disabled, the "problem" won't happen. That is why I put "for 2.6 desktop, low-latency desktop" in the uploaded paper. This "problem" happens in the 2.6 Desktop and Low-latency Desktop.
>
> CONFIG_PREEMPT is only for people that are in for the feeling. There is no
> real world advtantage to it and we should probably remove it again.
There certainly is a real world advantage for many applications. Of
course it would be better if the latency requirements could be met
without kernel preemption but that's not the case now.
Lee
>The solution is really simple and needs no kernel change at all: if you
>want the TCP receiver to get a larger share of timeslices then either
>renice it to -20 or renice the other tasks to +19.
Simply give a larger share of timeslices to the TCP receiver won't solve the
problem. No matter what the timeslice is, if the TCP receiving process has
packets within backlog, and the process is expired and moved to the expired
array, RTO might happen in the TCP sender.
The solution does not look like that simple.
wenji
From: Wenji Wu <[email protected]>
Date: Thu, 30 Nov 2006 10:08:22 -0600
> If the higher prioirty processes become runnable (e.g., interactive
> process), you better yield the CPU, instead of continuing this process. If
> it is the case that the process within tcp_recvmsg() is expriring, then, you
> can continue the process to go ahead to process backlog.
Yes, I understand this, and I made that point in one of my
replies to Ingo Molnar last night.
The only seemingly remaining possibility is to find a way to allow
input packet processing, at least enough to emit ACKs, during
tcp_recvmsg() processing.
From: Evgeniy Polyakov <[email protected]>
Date: Thu, 30 Nov 2006 13:22:06 +0300
> It steals timeslices from other processes to complete tcp_recvmsg()
> task, and only when it does it for too long, it will be preempted.
> Processing backlog queue on behalf of need_resched() will break
> fairness too - processing itself can take a lot of time, so process
> can be scheduled away in that part too.
Yes, at this point I agree with this analysis.
Currently I am therefore advocating some way to allow
full input packet handling even amidst tcp_recvmsg()
processing.
* Wenji Wu <[email protected]> wrote:
> >The solution is really simple and needs no kernel change at all: if
> >you want the TCP receiver to get a larger share of timeslices then
> >either renice it to -20 or renice the other tasks to +19.
>
> Simply give a larger share of timeslices to the TCP receiver won't
> solve the problem. No matter what the timeslice is, if the TCP
> receiving process has packets within backlog, and the process is
> expired and moved to the expired array, RTO might happen in the TCP
> sender.
if you still have the test-setup, could you nevertheless try setting the
priority of the receiving TCP task to nice -20 and see what kind of
performance you get?
Ingo
From: Ingo Molnar <[email protected]>
Date: Thu, 30 Nov 2006 11:32:40 +0100
> Note that even without the change the TCP receiving task is already
> getting a disproportionate share of cycles due to softirq processing!
> Under a load of 10.0 it went from 500 mbits to 74 mbits, while the
> 'fair' share would be 50 mbits. So the TCP receiver /already/ has an
> unfair advantage. The patch only deepends that unfairness.
I want to point out something which is slightly misleading about this
kind of analysis.
Your disk I/O speed doesn't go down by a factor of 10 just because 9
other non disk I/O tasks are running, yet for TCP that's seemingly OK
:-)
Not looking at input TCP packets enough to send out the ACKs is the
same as "forgetting" to queue some I/O requests that can go to the
controller right now.
That's the problem, TCP performance is intimately tied to ACK
feedback. So we should find a way to make sure ACK feedback goes
out, in preference to other tcp_recvmsg() processing.
What really should pace the TCP sender in this kind of situation is
the advertised window, not the lack of ACKs. Lack of an ACK mean the
packet didn't get there, which is the wrong signal in this kind of
situation, whereas a closing window means "application can't keep
up with the data rate, hold on..." and is the proper flow control
signal in this high load scenerio.
If you don't send ACKs, packets are retransmitted when there is no
reason for it, and that borders on illegal. :-)
* David Miller <[email protected]> wrote:
> I want to point out something which is slightly misleading about this
> kind of analysis.
>
> Your disk I/O speed doesn't go down by a factor of 10 just because 9
> other non disk I/O tasks are running, yet for TCP that's seemingly OK
> :-)
disk I/O is typically not CPU bound, and i believe these TCP tests /are/
CPU-bound. Otherwise there would be no expiry of the timeslice to begin
with and the TCP receiver task would always be boosted to 'interactive'
status by the scheduler and would happily chug along at 500 mbits ...
(and i grant you, if a disk IO test is 20% CPU bound in process context
and system load is 10, then the scheduler will throttle that task quite
effectively.)
Ingo
From: Ingo Molnar <[email protected]>
Date: Thu, 30 Nov 2006 21:30:26 +0100
> disk I/O is typically not CPU bound, and i believe these TCP tests /are/
> CPU-bound. Otherwise there would be no expiry of the timeslice to begin
> with and the TCP receiver task would always be boosted to 'interactive'
> status by the scheduler and would happily chug along at 500 mbits ...
It's about the prioritization of the work.
If all disk I/O were shut off and frozen while we copy file
data into userspace, you'd see the same problem for disk I/O.
> It steals timeslices from other processes to complete tcp_recvmsg()
> task, and only when it does it for too long, it will be preempted.
> Processing backlog queue on behalf of need_resched() will break
> fairness too - processing itself can take a lot of time, so process
> can be scheduled away in that part too.
It does steal timeslices from other processes to complete tcp_recvmsg()
task. But I do not think it will take long. When processing backlog, the
processed packets will go to the receive buffer, the TCP flow control will
take effect to slow down the sender.
The data receiving process might be preempted by higher priority processes.
Only the data recieving process stays in the active array, the problem is
not that bad because the process might resume its execution soon. The worst
case is that it expires and is moved to the active array with packets within
the backlog queue.
wenji
* David Miller <[email protected]> wrote:
> > disk I/O is typically not CPU bound, and i believe these TCP tests
> > /are/ CPU-bound. Otherwise there would be no expiry of the timeslice
> > to begin with and the TCP receiver task would always be boosted to
> > 'interactive' status by the scheduler and would happily chug along
> > at 500 mbits ...
>
> It's about the prioritization of the work.
>
> If all disk I/O were shut off and frozen while we copy file data into
> userspace, you'd see the same problem for disk I/O.
well, it's an issue of how much processing is done in non-prioritized
contexts. TCP is a bit more sensitive to process context being throttled
- but disk I/O is not immune either: if nothing submits new IO, or if
the task does shorts reads+writes then any process level throttling
immediately shows up in IO throughput.
but in the general sense it is /unfair/ that certain processing such as
disk and network IO can get a disproportionate amount of CPU time from
the system - just because they happen to have some of their processing
in IRQ and softirq context (which is essentially prioritized to
SCHED_FIFO 100). A system can easily spend 80% CPU time in softirq
context. (and that is easily visible in something like an -rt kernel
where various softirq contexts are separate threads and you can see 30%
net-rx and 20% net-tx CPU utilization in 'top'). How is this kind of
processing different from purely process-context based subsystems?
so i agree with you that by tweaking the TCP stack to be less sensitive
to process throttling you /will/ improve the relative performance of the
TCP receiver task - but in general system design and scheduler design
terms it's not a win.
i'd also agree with the notion that the current 'throttling' of process
contexts can be abrupt and uncooperative, and hence the TCP stack could
get more out of the same amount of CPU time if it used it in a smarter
way. As i pointed it out in the first mail i'd support the TCP stack
getting the ability to query how much timeslices it has - or even the
scheduler notifying the TCP stack via some downcall if
current->timeslice reaches 1 (or something like that).
So i dont support the scheme proposed here, the blatant bending of the
priority scale towards the TCP workload. Instead what i'd like to see is
more TCP performance (and a nicer over-the-wire behavior - no
retransmits for example) /with the same 10% CPU time used/. Are we in
rough agreement?
Ingo
From: Ingo Molnar <[email protected]>
Date: Thu, 30 Nov 2006 21:49:08 +0100
> So i dont support the scheme proposed here, the blatant bending of the
> priority scale towards the TCP workload.
I don't support this scheme either ;-)
That's why my proposal is to find a way to allow input packet
processing even during tcp_recvmsg() work. It is a solution that
would give the TCP task exactly it's time slice, no more, no less,
without the erroneous behavior of sleeping with packets held in the
socket backlog.
* Ingo Molnar <[email protected]> wrote:
> [...] Instead what i'd like to see is more TCP performance (and a
> nicer over-the-wire behavior - no retransmits for example) /with the
> same 10% CPU time used/. Are we in rough agreement?
put in another way: i'd like to see the "TCP bytes transferred per CPU
time spent by the TCP stack" ratio to be maximized in a load-independent
way (part of which is the sender host too: to not cause unnecessary
retransmits is important as well). In a high-load scenario this means
that any measure that purely improves TCP throughput by giving it more
cycles is not a real improvement. So the focus should be on throttling
intelligently and without causing extra work on the sender side either -
not on trying to circumvent throttling measures.
Ingo
>if you still have the test-setup, could you nevertheless try setting the
>priority of the receiving TCP task to nice -20 and see what kind of
>performance you get?
A process with nice of -20 can easily get the interactivity status. When it
expires, it still go back to the active array. It just hide the TCP problem,
instead of solving it.
For a process with nice value of -20, it will have the following advantages
over other processes:
(1) its timeslice is 800ms, the timeslice of a process with a nice value of
0 is 100ms
(2) it has higher priority than other processes
(3) it is easier to gain the interactivity status.
The chances that the process expires and moves to the expired array with
packets within backlog is much reduces, but still has the chance.
wenji
On Thu, Nov 30, 2006 at 12:14:43PM -0800, David Miller ([email protected]) wrote:
> > It steals timeslices from other processes to complete tcp_recvmsg()
> > task, and only when it does it for too long, it will be preempted.
> > Processing backlog queue on behalf of need_resched() will break
> > fairness too - processing itself can take a lot of time, so process
> > can be scheduled away in that part too.
>
> Yes, at this point I agree with this analysis.
>
> Currently I am therefore advocating some way to allow
> full input packet handling even amidst tcp_recvmsg()
> processing.
Isn't it a step in direction of full tcp processing bound to process
context? :)
--
Evgeniy Polyakov
From: Evgeniy Polyakov <[email protected]>
Date: Fri, 1 Dec 2006 12:53:07 +0300
> Isn't it a step in direction of full tcp processing bound to process
> context? :)
:-)
Rather, it is just finer grained locking.