On Mon, 2003-08-04 at 22:55, Andrew Morton wrote:
> There is a very good argument for giving !SCHED_OTHER tasks
> "special treatment" in the VM.
Yes, there is.
Attached patch is against 2.6.0-test2-mm4. It does two main things:
- Let real-time tasks dip further into the reserves than
usual in __alloc_pages(). There are a lot of ways to
special case this. This patch just cuts z->pages_low in
half, before doing the incremental min thing, for
real-time tasks. I do not do anything in the low memory
slow path. We can be a _lot_ more aggressive if we want.
Right now, we just give real-time tasks a little help.
- Never ever call balance_dirty_pages() on a real-time
task. Where and how exactly we handle this is up for
debate. We could, for example, special case real-time
tasks inside balance_dirty_pages(). This would allow
us to perform some of the work (say, waking up pdflush)
but not other work (say, the active throttling). As it
stands now, we do the per-processor accounting in
balance_dirty_pages_ratelimited() but we never call
balance_dirty_pages(). Lots of approaches work. What
we want to do is never engage the real-time task in
forced writeback.
It compiles, it boots, and it does not crash. I have not tested whether
it prevents any starvation in real-time applications that are being
observed -- mostly because I am not sure if my approach is what you
want. There are multiple ways to handle the real-time task path. I
picked one. I do not know.
Robert Love
include/linux/sched.h | 4 +++-
kernel/sched.c | 1 -
mm/page-writeback.c | 6 +++++-
mm/page_alloc.c | 29 ++++++++++++++++++++---------
4 files changed, 28 insertions(+), 12 deletions(-)
diff -urN linux-2.6.0-test2-mm4/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.6.0-test2-mm4/include/linux/sched.h 2003-08-05 14:53:47.000000000 -0700
+++ linux/include/linux/sched.h 2003-08-05 12:38:41.000000000 -0700
@@ -282,7 +282,9 @@
#define MAX_RT_PRIO MAX_USER_RT_PRIO
#define MAX_PRIO (MAX_RT_PRIO + 40)
-
+
+#define rt_task(p) ((p)->prio < MAX_RT_PRIO)
+
/*
* Some day this will be a full-fledged user tracking system..
*/
diff -urN linux-2.6.0-test2-mm4/kernel/sched.c linux/kernel/sched.c
--- linux-2.6.0-test2-mm4/kernel/sched.c 2003-08-05 14:53:47.000000000 -0700
+++ linux/kernel/sched.c 2003-08-05 12:38:29.000000000 -0700
@@ -199,7 +199,6 @@
#define this_rq() (cpu_rq(smp_processor_id())) /* not __get_cpu_var(runqueues)! */
#define task_rq(p) cpu_rq(task_cpu(p))
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
-#define rt_task(p) ((p)->prio < MAX_RT_PRIO)
/*
* Default context-switch locking:
diff -urN linux-2.6.0-test2-mm4/mm/page_alloc.c linux/mm/page_alloc.c
--- linux-2.6.0-test2-mm4/mm/page_alloc.c 2003-08-05 14:48:38.000000000 -0700
+++ linux/mm/page_alloc.c 2003-08-05 14:40:56.000000000 -0700
@@ -518,7 +518,8 @@
*
* Herein lies the mysterious "incremental min". That's the
*
- * min += z->pages_low;
+ * local_low = z->pages_low;
+ * min += local_low;
*
* thing. The intent here is to provide additional protection to low zones for
* allocation requests which _could_ use higher zones. So a GFP_HIGHMEM
@@ -536,10 +537,11 @@
unsigned long min;
struct zone **zones, *classzone;
struct page *page;
+ struct reclaim_state reclaim_state;
+ struct task_struct *p = current;
int i;
int cold;
int do_retry;
- struct reclaim_state reclaim_state;
if (wait)
might_sleep();
@@ -557,8 +559,17 @@
min = 1UL << order;
for (i = 0; zones[i] != NULL; i++) {
struct zone *z = zones[i];
+ unsigned long local_low;
+
+ /*
+ * This is the fabled 'incremental min'. We let real-time tasks
+ * dip their real-time paws a little deeper into reserves.
+ */
+ local_low = z->pages_low;
+ if (rt_task(p))
+ local_low >>= 1;
+ min += local_low;
- min += z->pages_low;
if (z->free_pages >= min ||
(!wait && z->free_pages >= z->pages_high)) {
page = buffered_rmqueue(z, order, cold);
@@ -594,7 +605,7 @@
/* here we're in the low on memory slow path */
rebalance:
- if ((current->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {
+ if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {
/* go through the zonelist yet again, ignoring mins */
for (i = 0; zones[i] != NULL; i++) {
struct zone *z = zones[i];
@@ -610,14 +621,14 @@
if (!wait)
goto nopage;
- current->flags |= PF_MEMALLOC;
+ p->flags |= PF_MEMALLOC;
reclaim_state.reclaimed_slab = 0;
- current->reclaim_state = &reclaim_state;
+ p->reclaim_state = &reclaim_state;
try_to_free_pages(classzone, gfp_mask, order);
- current->reclaim_state = NULL;
- current->flags &= ~PF_MEMALLOC;
+ p->reclaim_state = NULL;
+ p->flags &= ~PF_MEMALLOC;
/* go through the zonelist yet one more time */
min = 1UL << order;
@@ -657,7 +668,7 @@
if (!(gfp_mask & __GFP_NOWARN)) {
printk("%s: page allocation failure."
" order:%d, mode:0x%x\n",
- current->comm, order, gfp_mask);
+ p->comm, order, gfp_mask);
}
return NULL;
got_pg:
diff -urN linux-2.6.0-test2-mm4/mm/page-writeback.c linux/mm/page-writeback.c
--- linux-2.6.0-test2-mm4/mm/page-writeback.c 2003-08-05 14:53:47.000000000 -0700
+++ linux/mm/page-writeback.c 2003-08-05 13:48:43.000000000 -0700
@@ -227,7 +227,11 @@
if (dirty_exceeded)
ratelimit = 8;
- if (get_cpu_var(ratelimits)++ >= ratelimit) {
+ /*
+ * Check the rate limiting. Also, we do not want to throttle real-time
+ * tasks in balance_dirty_pages(). Period.
+ */
+ if (get_cpu_var(ratelimits)++ >= ratelimit && !rt_task(current)) {
__get_cpu_var(ratelimits) = 0;
put_cpu_var(ratelimits);
return balance_dirty_pages(mapping);
Robert Love <[email protected]> wrote:
>
> > There is a very good argument for giving !SCHED_OTHER tasks
> > "special treatment" in the VM.
>
> Yes, there is.
>
> Attached patch is against 2.6.0-test2-mm4.
Thanks. I guess we should extend the special treatment in the page
allocator and in balance_dirty_pages() we still should go and update the
various bits of state and poke pdflush if needed.
These changes should yield quite significant improvements in worst-case
latencies for realtime tasks under some situations. I'll test them...
diff -puN mm/page_alloc.c~rt-tasks-special-vm-treatment-2 mm/page_alloc.c
--- 25/mm/page_alloc.c~rt-tasks-special-vm-treatment-2 2003-08-05 16:57:51.000000000 -0700
+++ 25-akpm/mm/page_alloc.c 2003-08-05 16:58:11.000000000 -0700
@@ -592,6 +592,8 @@ __alloc_pages(unsigned int gfp_mask, uns
local_min = z->pages_min;
if (gfp_mask & __GFP_HIGH)
local_min >>= 2;
+ if (rt_task(p))
+ local_min >>= 1;
min += local_min;
if (z->free_pages >= min ||
(!wait && z->free_pages >= z->pages_high)) {
diff -puN mm/page-writeback.c~rt-tasks-special-vm-treatment-2 mm/page-writeback.c
--- 25/mm/page-writeback.c~rt-tasks-special-vm-treatment-2 2003-08-05 16:57:51.000000000 -0700
+++ 25-akpm/mm/page-writeback.c 2003-08-05 17:02:40.000000000 -0700
@@ -144,7 +144,7 @@ get_dirty_limits(struct page_state *ps,
* If we're over `background_thresh' then pdflush is woken to perform some
* writeout.
*/
-void balance_dirty_pages(struct address_space *mapping)
+static void balance_dirty_pages(struct address_space *mapping)
{
struct page_state ps;
long nr_reclaimable;
@@ -167,8 +167,9 @@ void balance_dirty_pages(struct address_
nr_reclaimable = ps.nr_dirty + ps.nr_unstable;
if (nr_reclaimable + ps.nr_writeback <= dirty_thresh)
break;
-
dirty_exceeded = 1;
+ if (rt_task(current))
+ break;
/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
* Unstable writes are a feature of certain networked
@@ -223,7 +224,7 @@ void balance_dirty_pages_ratelimited(str
* Check the rate limiting. Also, we do not want to throttle real-time
* tasks in balance_dirty_pages(). Period.
*/
- if (get_cpu_var(ratelimits)++ >= ratelimit && !rt_task(current)) {
+ if (get_cpu_var(ratelimits)++ >= ratelimit) {
__get_cpu_var(ratelimits) = 0;
put_cpu_var(ratelimits);
balance_dirty_pages(mapping);
_
On Tue, 2003-08-05 at 17:09, Andrew Morton wrote:
> -void balance_dirty_pages(struct address_space *mapping)
> +static void balance_dirty_pages(struct address_space *mapping)
Hrm. void? I have this as an int in my tree (test2-mm4), did you change
something? The function returns stuff.. I made it a 'static int'
> dirty_exceeded = 1;
> + if (rt_task(current))
> + break;
OK, this was my other option. I think this is better because, as we have
both said, it allows us to wake up pdflush.
Here is what I have right now, now ..
Robert Love
include/linux/sched.h | 4 +++-
kernel/sched.c | 1 -
mm/page-writeback.c | 11 +++++++++--
mm/page_alloc.c | 31 ++++++++++++++++++++++---------
4 files changed, 34 insertions(+), 13 deletions(-)
diff -urN linux-2.6.0-test2-mm4/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.6.0-test2-mm4/include/linux/sched.h 2003-08-05 14:53:47.000000000 -0700
+++ linux/include/linux/sched.h 2003-08-05 12:38:41.000000000 -0700
@@ -282,7 +282,9 @@
#define MAX_RT_PRIO MAX_USER_RT_PRIO
#define MAX_PRIO (MAX_RT_PRIO + 40)
-
+
+#define rt_task(p) ((p)->prio < MAX_RT_PRIO)
+
/*
* Some day this will be a full-fledged user tracking system..
*/
diff -urN linux-2.6.0-test2-mm4/kernel/sched.c linux/kernel/sched.c
--- linux-2.6.0-test2-mm4/kernel/sched.c 2003-08-05 14:53:47.000000000 -0700
+++ linux/kernel/sched.c 2003-08-05 12:38:29.000000000 -0700
@@ -199,7 +199,6 @@
#define this_rq() (cpu_rq(smp_processor_id())) /* not __get_cpu_var(runqueues)! */
#define task_rq(p) cpu_rq(task_cpu(p))
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
-#define rt_task(p) ((p)->prio < MAX_RT_PRIO)
/*
* Default context-switch locking:
diff -urN linux-2.6.0-test2-mm4/mm/page_alloc.c linux/mm/page_alloc.c
--- linux-2.6.0-test2-mm4/mm/page_alloc.c 2003-08-05 14:48:38.000000000 -0700
+++ linux/mm/page_alloc.c 2003-08-05 17:22:30.000000000 -0700
@@ -518,7 +518,8 @@
*
* Herein lies the mysterious "incremental min". That's the
*
- * min += z->pages_low;
+ * local_low = z->pages_low;
+ * min += local_low;
*
* thing. The intent here is to provide additional protection to low zones for
* allocation requests which _could_ use higher zones. So a GFP_HIGHMEM
@@ -536,10 +537,11 @@
unsigned long min;
struct zone **zones, *classzone;
struct page *page;
+ struct reclaim_state reclaim_state;
+ struct task_struct *p = current;
int i;
int cold;
int do_retry;
- struct reclaim_state reclaim_state;
if (wait)
might_sleep();
@@ -557,8 +559,17 @@
min = 1UL << order;
for (i = 0; zones[i] != NULL; i++) {
struct zone *z = zones[i];
+ unsigned long local_low;
+
+ /*
+ * This is the fabled 'incremental min'. We let real-time tasks
+ * dip their real-time paws a little deeper into reserves.
+ */
+ local_low = z->pages_low;
+ if (rt_task(p))
+ local_low >>= 1;
+ min += local_low;
- min += z->pages_low;
if (z->free_pages >= min ||
(!wait && z->free_pages >= z->pages_high)) {
page = buffered_rmqueue(z, order, cold);
@@ -581,6 +592,8 @@
local_min = z->pages_min;
if (gfp_mask & __GFP_HIGH)
local_min >>= 2;
+ if (rt_task(p))
+ local_min >>= 1;
min += local_min;
if (z->free_pages >= min ||
(!wait && z->free_pages >= z->pages_high)) {
@@ -594,7 +607,7 @@
/* here we're in the low on memory slow path */
rebalance:
- if ((current->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {
+ if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {
/* go through the zonelist yet again, ignoring mins */
for (i = 0; zones[i] != NULL; i++) {
struct zone *z = zones[i];
@@ -610,14 +623,14 @@
if (!wait)
goto nopage;
- current->flags |= PF_MEMALLOC;
+ p->flags |= PF_MEMALLOC;
reclaim_state.reclaimed_slab = 0;
- current->reclaim_state = &reclaim_state;
+ p->reclaim_state = &reclaim_state;
try_to_free_pages(classzone, gfp_mask, order);
- current->reclaim_state = NULL;
- current->flags &= ~PF_MEMALLOC;
+ p->reclaim_state = NULL;
+ p->flags &= ~PF_MEMALLOC;
/* go through the zonelist yet one more time */
min = 1UL << order;
@@ -657,7 +670,7 @@
if (!(gfp_mask & __GFP_NOWARN)) {
printk("%s: page allocation failure."
" order:%d, mode:0x%x\n",
- current->comm, order, gfp_mask);
+ p->comm, order, gfp_mask);
}
return NULL;
got_pg:
diff -urN linux-2.6.0-test2-mm4/mm/page-writeback.c linux/mm/page-writeback.c
--- linux-2.6.0-test2-mm4/mm/page-writeback.c 2003-08-05 14:53:47.000000000 -0700
+++ linux/mm/page-writeback.c 2003-08-05 17:35:36.095648523 -0700
@@ -145,7 +145,7 @@
* If we're over `background_thresh' then pdflush is woken to perform some
* writeout.
*/
-int balance_dirty_pages(struct address_space *mapping)
+static int balance_dirty_pages(struct address_space *mapping)
{
struct page_state ps;
long nr_reclaimable;
@@ -169,9 +169,16 @@
nr_reclaimable = ps.nr_dirty + ps.nr_unstable;
if (nr_reclaimable + ps.nr_writeback <= dirty_thresh)
break;
-
dirty_exceeded = 1;
+ /*
+ * We do not want to throttle a real-time task here. Ever.
+ * But we do want to update the accounting and possibly poke
+ * pdflush below.
+ */
+ if (rt_task(current))
+ break;
+
/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
* Unstable writes are a feature of certain networked
* filesystems (i.e. NFS) in which data may have been
Robert Love <[email protected]> wrote:
>
> On Tue, 2003-08-05 at 17:09, Andrew Morton wrote:
>
> > -void balance_dirty_pages(struct address_space *mapping)
> > +static void balance_dirty_pages(struct address_space *mapping)
>
> Hrm. void? I have this as an int in my tree (test2-mm4), did you change
> something? The function returns stuff.. I made it a 'static int'
ah, I had inserted that patch before the AIO patches, which change that.
> > dirty_exceeded = 1;
> > + if (rt_task(current))
> > + break;
>
> OK, this was my other option. I think this is better because, as we have
> both said, it allows us to wake up pdflush.
>
> Here is what I have right now, now ..
It's testing time.
On Tue, 2003-08-05 at 17:45, Andrew Morton wrote:
> It's testing time.
Just via some instrumenting, I can see that a real-time task never
begins throttling and this translates to a ~1ms reduction in worst case
allocation on a fast machine latency under extreme page dirtying and
writeback (basically, I cannot reproduce any variation in page
allocation, now, for a real-time test app). So it works.
But I do not have any real world test to confirm a benefit, which is
what matters. Have you poked and prodded?
Robert Love
Robert Love <[email protected]> wrote:
>
> On Tue, 2003-08-05 at 17:45, Andrew Morton wrote:
>
> > It's testing time.
>
> Just via some instrumenting, I can see that a real-time task never
> begins throttling and this translates to a ~1ms reduction in worst case
> allocation on a fast machine latency under extreme page dirtying and
> writeback (basically, I cannot reproduce any variation in page
> allocation, now, for a real-time test app). So it works.
>
> But I do not have any real world test to confirm a benefit, which is
> what matters. Have you poked and prodded?
>
It's pretty easy to demonstrate the benefit of the balance_dirty_pages()
change. Just do:
while true
do
dd if=/dev/zero of=foo bs=1M count=512 conv=notrunc
done
and also:
rm 1 ; sleep 3; time dd if=/dev/zero of=1 bs=16M count=1
The 16M dd normally takes 1.5 seconds (I'm pretty please with that btw.
Very repeatable and fair). If you run the 16M dd with SCHED_FIFO it takes
a repeatable 0.12 seconds.
But it's pretty easy to make the machine do bad things with no dirty memory
throttling. Probably we should just treat realtime tasks in the same way
as PF_LESS_THROTTLE tasks.
It's a bit harder to demonstrate benefits from the page allocator change.
Allocate 4 megs while there is a huge amount of swapout happening is much
quicker and repeatable. But generally things work pretty well with just
SCHED_OTHER. Needs some more careful testing.
Often you get great big stalls because your SCHED_RR task is stuck in
ext3's journal_start, waiting for kjournald to finish a commit, due to an
atime update. So noatime is needed there. And then it gets stuck in a
disk read.
So running a program off disk isn't a very good test.
On Wed, 2003-08-06 at 01:41, Andrew Morton wrote:
> It's pretty easy to demonstrate the benefit of the balance_dirty_pages()
> change. Just do:
>
> while true
> do
> dd if=/dev/zero of=foo bs=1M count=512 conv=notrunc
> done
>
> and also:
>
> rm 1 ; sleep 3; time dd if=/dev/zero of=1 bs=16M count=1
>
> The 16M dd normally takes 1.5 seconds (I'm pretty please with that btw.
> Very repeatable and fair). If you run the 16M dd with SCHED_FIFO it takes
> a repeatable 0.12 seconds.
This is what I did. Same results, basically.
What I did not do was prove that the xmms stalls went away for those who
were seeing that.
> So running a program off disk isn't a very good test.
No, its not. And in general, real-time tasks should not do disk I/O (at
least not via their core RT thread). And they should mlock() their
memory.
But circumstances do differ, and these changes are in the right
direction, I think. It also means e.g. someone can make xmms or whatever
real-time, and hopefully avoid the memory-related stalls that spawned
the discussion and this patch.
Robert Love