LinuxLists.cc - Disk schedulers

2008-02-14 16:21:19

Subject: Disk schedulers

Hello,

whom should I blame about disk schedulers?

I have the following setup:
1Gb network
2GB RAM
disk write speed about 20MB/s

If I'm scping file (about 500MB) from the network (which is faster than the
local disk), any process is totally unable to read anything from the local disk
till the scp finishes. It is not caused by low free memory, while scping
I have 500MB of free memory (not cached or buffered).

I tried cfq and anticipatory scheduler, none is different.

--
Luk?? Hejtm?nek

2008-02-15 00:02:47

by Tejun Heo

[permalink] [raw]

Subject: Re: Disk schedulers

Lukas Hejtmanek wrote:
> Hello,
>
> whom should I blame about disk schedulers?
>
> I have the following setup:
> 1Gb network
> 2GB RAM
> disk write speed about 20MB/s
>
> If I'm scping file (about 500MB) from the network (which is faster than the
> local disk), any process is totally unable to read anything from the local disk
> till the scp finishes. It is not caused by low free memory, while scping
> I have 500MB of free memory (not cached or buffered).
>
> I tried cfq and anticipatory scheduler, none is different.
>

Does deadline help?

--
tejun

2008-02-15 10:09:34

by Lukas Hejtmanek

[permalink] [raw]

Subject: Re: Disk schedulers

On Fri, Feb 15, 2008 at 09:02:31AM +0900, Tejun Heo wrote:
> > till the scp finishes. It is not caused by low free memory, while scping
> > I have 500MB of free memory (not cached or buffered).
> >
> > I tried cfq and anticipatory scheduler, none is different.
> >
>
> Does deadline help?

well, deadline is a little bit better. I'm trying to read from disk opening
maildir with 20000 mails with mutt. If I open that maildir, mutt shows
progress. With cfq or anticipatory scheduler, progress is 0/20000 until scp
finishes. With deadline, progress is 150/20000 after scp finished. So I would
say, it is better but I doubt it is OK.

--
Luk?? Hejtm?nek

2008-02-15 14:43:12

by Jan Engelhardt

[permalink] [raw]

Subject: Re: Disk schedulers

On Feb 14 2008 17:21, Lukas Hejtmanek wrote:
>Hello,
>
>whom should I blame about disk schedulers?

Also consider
- DMA (e.g. only UDMA2 selected)
- aging disk

2008-02-15 14:58:18

by Prakash Punnoor

[permalink] [raw]

Subject: Re: Disk schedulers

On the day of Friday 15 February 2008 Jan Engelhardt hast written:
> On Feb 14 2008 17:21, Lukas Hejtmanek wrote:
> >Hello,
> >
> >whom should I blame about disk schedulers?
>
> Also consider
> - DMA (e.g. only UDMA2 selected)
> - aging disk

Nope, I also reported this problem _years_ ago, but till now much hasn't
changed. Large writes lead to read starvation.

--
(?= =?)
//\ Prakash Punnoor /\\
V_/ \_V

Attachments:

(No filename) (441.00 B)
signature.asc (189.00 B)
This is a digitally signed message part. Download all attachments

2008-02-15 15:59:36

by Lukas Hejtmanek

[permalink] [raw]

Subject: Re: Disk schedulers

On Fri, Feb 15, 2008 at 03:42:58PM +0100, Jan Engelhardt wrote:
> Also consider
> - DMA (e.g. only UDMA2 selected)
> - aging disk

it's not the case.

hdparm reports udma5 is used, if it is reliable with libata.

The disk is 3 months old, kernel does not report any errors. And it has never
been different.

--
Luk?? Hejtm?nek

2008-02-15 16:50:52

by Jeffrey Hundstad

[permalink] [raw]

Subject: Re: Disk schedulers

Lukas Hejtmanek,

I have to say, that I've heard this subject before, the summary answer
seems to be, that the kernel can not guess the wishes of the user 100%
of the time. If you have a low priority I/O task use ionice(1) to set
the priority of that task so it doesn't nuke your high priority task.

I have to personal stake in this answer but I can report that for my
high I/O tasks it does work like a charm.

--
Jeffrey Hundstad

Lukas Hejtmanek wrote:
> On Fri, Feb 15, 2008 at 03:42:58PM +0100, Jan Engelhardt wrote:
>
>> Also consider
>> - DMA (e.g. only UDMA2 selected)
>> - aging disk
>>
>
> it's not the case.
>
> hdparm reports udma5 is used, if it is reliable with libata.
>
> The disk is 3 months old, kernel does not report any errors. And it has never
> been different.
>
> --
> Luk?? Hejtm?nek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-02-15 17:11:22

by Zan Lynx

[permalink] [raw]

Subject: Re: Disk schedulers

On Fri, 2008-02-15 at 15:57 +0100, Prakash Punnoor wrote:
> On the day of Friday 15 February 2008 Jan Engelhardt hast written:
> > On Feb 14 2008 17:21, Lukas Hejtmanek wrote:
> > >Hello,
> > >
> > >whom should I blame about disk schedulers?
> >
> > Also consider
> > - DMA (e.g. only UDMA2 selected)
> > - aging disk
>
> Nope, I also reported this problem _years_ ago, but till now much hasn't
> changed. Large writes lead to read starvation.

Yes, I see this often myself. It's like the disk IO queue (I set mine
to 1024) fills up, and pdflush and friends can stuff write requests into
it much more quickly than any other programs can provide read requests.

CFQ and ionice work very well up until iostat shows average IO queuing
above 1024 (where I set the queue number).
--
Zan Lynx <[email protected]>

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2008-02-15 17:25:01

by Paulo Marques

[permalink] [raw]

Subject: Re: Disk schedulers

Lukas Hejtmanek wrote:
> [...]
> If I'm scping file (about 500MB) from the network (which is faster than the
> local disk), any process is totally unable to read anything from the local disk
> till the scp finishes. It is not caused by low free memory, while scping
> I have 500MB of free memory (not cached or buffered).

If you want to take advantage of all that memory to buffer disk writes,
so that the reads can proceed better, you might want to tweak your
/proc/sys/vm/dirty_ratio amd /proc/sys/vm/dirty_background_ratio to more
appropriate values. (maybe also dirty_writeback_centisecs and
dirty_expire_centisecs)

You can read all about those tunables in Documentation/filesystems/proc.txt.

Just my 2 cents,

--
Paulo Marques - http://www.grupopie.com

"Very funny Scotty. Now beam up my clothes."

2008-02-15 17:37:01

by Roger Heflin

[permalink] [raw]

Subject: Re: Disk schedulers

Lukas Hejtmanek wrote:
> On Fri, Feb 15, 2008 at 03:42:58PM +0100, Jan Engelhardt wrote:
>> Also consider
>> - DMA (e.g. only UDMA2 selected)
>> - aging disk
>
> it's not the case.
>
> hdparm reports udma5 is used, if it is reliable with libata.
>
> The disk is 3 months old, kernel does not report any errors. And it has never
> been different.
>

A new current ide/sata disk should do around 60mb/second, check the
min/max bps rate listed on the disk companies site, and divide by 8, and
take maybe 80% of that

Also you may consider using the -l option on the scp command to limit
its total usage.

This feature has been around at least 8 years (from 2.2) that high
levels of writes would significantly starve out reads, mainly because
you can queue up 1000's of writes, and a read, when the read
finishes there are another 1000's writes for the next read to
get in line behind, and wait, and this continues until the
writes stop.

Roger

2008-02-15 21:32:32

by FD Cami

[permalink] [raw]

Subject: Re: Disk schedulers

On Fri, 15 Feb 2008 10:11:26 -0700
Zan Lynx <[email protected]> wrote:

>
> On Fri, 2008-02-15 at 15:57 +0100, Prakash Punnoor wrote:
> > On the day of Friday 15 February 2008 Jan Engelhardt hast written:
> > > On Feb 14 2008 17:21, Lukas Hejtmanek wrote:
> > > >Hello,
> > > >
> > > >whom should I blame about disk schedulers?
> > >
> > > Also consider
> > > - DMA (e.g. only UDMA2 selected)
> > > - aging disk
> >
> > Nope, I also reported this problem _years_ ago, but till now much hasn't
> > changed. Large writes lead to read starvation.
>
> Yes, I see this often myself. It's like the disk IO queue (I set mine
> to 1024) fills up, and pdflush and friends can stuff write requests into
> it much more quickly than any other programs can provide read requests.
>
> CFQ and ionice work very well up until iostat shows average IO queuing
> above 1024 (where I set the queue number).

I can confirm that as well.

This is easily reproductible with dd if=/dev/zero of=somefile bs=2048
for example. After a short while, trying to read the disks takes an
awfully long time, even if the dd process is ionice'd.

What is worse is that other drives attached to the same controller become
unresponsive as well.
I use a Dell Perc 5/i (megaraid_sas) with :
* 2 SAS 15000 RPM drives, RAID1 => sda
* 4 SAS 15000 RPM drives, RAID5 => sdb
* 2 SATA 72000 RPM drives, RAID1 => sdc
Using dd or mkfs on sdb or sdc makes sda unresponsive as well.
Is this expected ?

Cheers

Francois

2008-02-16 16:13:35

by Lukas Hejtmanek

[permalink] [raw]

Subject: Re: Disk schedulers

On Fri, Feb 15, 2008 at 10:11:26AM -0700, Zan Lynx wrote:
> Yes, I see this often myself. It's like the disk IO queue (I set mine
> to 1024) fills up, and pdflush and friends can stuff write requests into
> it much more quickly than any other programs can provide read requests.
>
> CFQ and ionice work very well up until iostat shows average IO queuing
> above 1024 (where I set the queue number).

I though that CFQ would maintain IO queues per process and pick up request in
round robin from non-empty queues. Am I wrong? And if wrong, isn't it desired
behavior for desktop?

--
Luk?? Hejtm?nek

2008-02-16 16:15:37

by Lukas Hejtmanek

[permalink] [raw]

Subject: Re: Disk schedulers

On Fri, Feb 15, 2008 at 05:24:52PM +0000, Paulo Marques wrote:
> If you want to take advantage of all that memory to buffer disk writes,
> so that the reads can proceed better, you might want to tweak your
> /proc/sys/vm/dirty_ratio amd /proc/sys/vm/dirty_background_ratio to more
> appropriate values. (maybe also dirty_writeback_centisecs and
> dirty_expire_centisecs)

I don't feel like to have my whole memory eaten by a single file which is not
to be read again and thus it is pretty useless. Instead, I would like to see
slowdown of scp so that other processes can also access disk. Why is this
possible with kernel process scheduler and not with IO scheduler?

--
Luk?? Hejtm?nek

2008-02-17 19:38:19

by L A Walsh

[permalink] [raw]

Subject: Re: Disk schedulers

Lukas Hejtmanek wrote:
> whom should I blame about disk schedulers?
>
> I have the following setup:
> 1Gb network
> 2GB RAM
> disk write speed about 20MB/s
>
> If I'm scping file (about 500MB) from the network (which is faster than the
> local disk), any process is totally unable to read anything from the local disk
> till the scp finishes. It is not caused by low free memory, while scping
> I have 500MB of free memory (not cached or buffered).
>
> I tried cfq and anticipatory scheduler, none is different.
> ----
>
You didn't say anything about #processors or speed, nor
did you say anything about your hard disk's raw-io ability.
You also didn't mention what kernel version or whether or not
you were using the new UID-group cpu scheduler in 2.6.24 (which likes
to default to 'on'; not a great choice for single-user, desktop-type
machines, if I understand its grouping policy).

Are you sure neither end of the copy is cpu-bound on ssh/scp
encrypt/decrypt calculations? It might not just be inability
to read from disk, but low cpu availability. Scp can be alot
more CPU intensive than you would expect... Just something to
consider...

Linda

2008-02-20 11:34:05

by Pavel Machek

[permalink] [raw]

Subject: Re: Disk schedulers

Hi!

> whom should I blame about disk schedulers?
>
> I have the following setup:
> 1Gb network
> 2GB RAM
> disk write speed about 20MB/s
>
> If I'm scping file (about 500MB) from the network (which is faster than the
> local disk), any process is totally unable to read anything from the local disk
> till the scp finishes. It is not caused by low free memory, while scping
> I have 500MB of free memory (not cached or buffered).
>
> I tried cfq and anticipatory scheduler, none is different.

Is cat /dev/zero > file enough to reproduce this?

ext3 filesystem?

Will cat /etc/passwd work while machine is unresponsive?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-02-20 17:05:02

by Zdenek Kabelac

[permalink] [raw]

Subject: Re: Disk schedulers

2008/2/15, Zan Lynx <[email protected]>:
>
> On Fri, 2008-02-15 at 15:57 +0100, Prakash Punnoor wrote:
> > On the day of Friday 15 February 2008 Jan Engelhardt hast written:
> > > On Feb 14 2008 17:21, Lukas Hejtmanek wrote:
> > > >Hello,
> > > >
> > > >whom should I blame about disk schedulers?
> > >
> > > Also consider
> > > - DMA (e.g. only UDMA2 selected)
> > > - aging disk
> >
> > Nope, I also reported this problem _years_ ago, but till now much hasn't
> > changed. Large writes lead to read starvation.
>
> Yes, I see this often myself. It's like the disk IO queue (I set mine
> to 1024) fills up, and pdflush and friends can stuff write requests into
> it much more quickly than any other programs can provide read requests.
>
> CFQ and ionice work very well up until iostat shows average IO queuing
> above 1024 (where I set the queue number).

I should probably summarize my experience here as well:

I'm using Qemu - inside of it I'm testing some kernel module which is
doing a lot of disk copy operation - its virtual disk has 8GB. When
my test is started my system starts to feel unresponsive couple times
per minute for nearly 10 minutes - especially if I use some chat tool
like pidgin I'm often left for 5 secs without any visible refresh on
screen (redraw, typed keys,...) Firefox shows similar symptoms...

Obviously piding has its own responsibility here - because from strace
it's visible it tries to open and read files - that he read already
zillion times before :) - but that's another story.

But I've tried many things - I've started qemu with ionice -c0, used
ionice -c2 for pidgin, tried different io-scheduler, niced qemu,
changed swappiness to different values according to various tips &
tricks around the web I could find and I cannot get properly running
system with my qemu test case because the system feels unresponsive in
some application which needs to touch my drive.

Does anyone have any ideas what should I try/test/check

BTW one interesting things I've noticed is very high kernel IPI number:
i.e.
77,0% (3794,9) <kernel IPI> : Rescheduling interrupts

Sometimes this number attacks 10000 barrier.

My machine is 2.2GHz C2D, T61, 2GB - and CPU is 50% idle while machine
freezes - and yes I can move the mouse all the time ;) and no I'm not
out-of-ram

Zdenek

2008-02-20 18:49:38

by Lukas Hejtmanek

[permalink] [raw]

Subject: Re: Disk schedulers

On Sat, Feb 16, 2008 at 05:20:49PM +0000, Pavel Machek wrote:
> Is cat /dev/zero > file enough to reproduce this?

yes.

> ext3 filesystem?

yes.

> Will cat /etc/passwd work while machine is unresponsive?

yes.

while find does not work:
time find /
/
/etc
/etc/manpath.config
/etc/update-manager
/etc/update-manager/release-upgrades
/etc/gshadow-
/etc/inputrc
/etc/openalrc
/etc/bonobo-activation
/etc/bonobo-activation/bonobo-activation-config.xml
/etc/gnome-vfs-2.0
/etc/gnome-vfs-2.0/modules
/etc/gnome-vfs-2.0/modules/obex-module.conf
/etc/gnome-vfs-2.0/modules/extra-modules.conf
/etc/gnome-vfs-2.0/modules/theme-method.conf
/etc/gnome-vfs-2.0/modules/font-method.conf
/etc/gnome-vfs-2.0/modules/default-modules.conf
^C

real 0m7.982s
user 0m0.003s
sys 0m0.000s

i.e., it took 8 seconds to list just 17 dir entries.

It looks like I have this problem:
http://www.linuxinsight.com/first_benchmarks_of_the_ext4_file_system.html#comment-619
(the last comment with title: Sustained writes 2 or more times the amount of
memfree....)

--
Luk?? Hejtm?nek

2008-02-21 23:06:59

by Giuliano Pochini

[permalink] [raw]

Subject: Re: Disk schedulers

On Wed, 20 Feb 2008 19:48:42 +0100
Lukas Hejtmanek <[email protected]> wrote:

> On Sat, Feb 16, 2008 at 05:20:49PM +0000, Pavel Machek wrote:
> > Is cat /dev/zero > file enough to reproduce this?
>
> yes.
>
>
> > ext3 filesystem?
>
> yes.
>
> > Will cat /etc/passwd work while machine is unresponsive?
>
> yes.
>
> while find does not work:
> time find /
> /
> /etc
> /etc/manpath.config
> /etc/update-manager
> /etc/update-manager/release-upgrades
> /etc/gshadow-
> /etc/inputrc
> /etc/openalrc
> /etc/bonobo-activation
> /etc/bonobo-activation/bonobo-activation-config.xml
> /etc/gnome-vfs-2.0
> /etc/gnome-vfs-2.0/modules
> /etc/gnome-vfs-2.0/modules/obex-module.conf
> /etc/gnome-vfs-2.0/modules/extra-modules.conf
> /etc/gnome-vfs-2.0/modules/theme-method.conf
> /etc/gnome-vfs-2.0/modules/font-method.conf
> /etc/gnome-vfs-2.0/modules/default-modules.conf
> ^C
>
> real 0m7.982s
> user 0m0.003s
> sys 0m0.000s
>
>
> i.e., it took 8 seconds to list just 17 dir entries.

It also happens when I'm writing to a slow external disk.
Documentation/block/biodoc.txt says:

"Per-queue granularity unplugging (still a Todo) may help reduce some of
the concerns with just a single tq_disk flush approach. Something like
blk_kick_queue() to unplug a specific queue (right away ?) or optionally,
all queues, is in the plan."

If I understand correctly, there is only one "plug" in common for all
devices. It may explain why when a queue is full, access to other devices
is also blocked.

--
Giuliano.

2008-02-28 17:12:13

by Bill Davidsen

[permalink] [raw]

Subject: Re: Disk schedulers

Linda Walsh wrote:

> You didn't say anything about #processors or speed, nor
> did you say anything about your hard disk's raw-io ability.
> You also didn't mention what kernel version or whether or not
> you were using the new UID-group cpu scheduler in 2.6.24 (which likes
> to default to 'on'; not a great choice for single-user, desktop-type
> machines, if I understand its grouping policy).
> Are you sure neither end of the copy is cpu-bound on ssh/scp
> encrypt/decrypt calculations? It might not just be inability
> to read from disk, but low cpu availability. Scp can be alot
> more CPU intensive than you would expect... Just something to
> consider...
>
Good point, and may I note that you *can* choose your encryption, I use
'blowfish' for normal operation, since I have some faitly slow machines
with a Gbit net between them.

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot