We recently upgraded 10 servers from 2.2.19 to 2.4.14/2.4.16. Since then,
several servers have experienced severe lockups forcing hardware resets. The
machines are Intel PIII (Dual) SMP running Epox motherboards. Here are the
details:
## The Story:
- Suddenly a machine gets a load average of about 500-1000.
- It's not possible to log in either at the console or by SSH.
- Some commands are possible to run through ssh from a remote server, like:
"ssh badserver ps auxwf" or "ssh badserver free"
- Despite a system load of 1000, commands like "free", "ps" and "uptime"
often respond quickly, no "sluggishness".
- The locked up machine seems to use all available memory plus a good deal
of swap
- The process table gets bigger and bigger, mainly ipop3d processes from
users trying to fetch mail but getting no reply.
- The processors seem to be mostly idle.
- Killing processes doesn't work, not even with SIGKILL.
- We haven't been able to find a time pattern for the lockups, or to
reproduce them at will.
- No kernel error messages are written to the console or logs.
- Ctrl-alt-delete produces a "Rebooting"-message on the console, but there
is no actual reboot. Power cycling is the only way out.
- My not-so-professional guess is that the machine is locked up waiting for
some disk i/o that never happens, either to swap or normal filesystem. But,
I might be all wrong.
## Hardware:
- Dual PIII 850 on Epox BXB-S and Epox KP6-BS
- 1Gb RAM (4x256)
- Mylex AcceleRAID 352 PCI RAID Controller,
IBM disks, 3x36Gb Raid-5 mounted on /
and 2x18 Raid-1 mounted on /var/spool
- 1x20Gb IDE for /boot and swap (2 x 2Gb swap partitions)
- 1x36Gb IDE for backups
## Kernel:
- 2.4.14 and 2.4.16
- Patched for reiserfs-quota with patches found at
ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/
( * 50_quota-patch
* dquota_deadlock
* nesting
* reiserfs-quota )
- Complete kernel-config found here:
http://www.ekenberg.se/2.4-trouble/2.4.16-config
- Boot parameters are: "ether=0,0,eth1 panic=60 noapic"
## Filesystems:
- ReiserFS (3.6) except /boot which is ext2
## General
- The servers are used mainly for:
* Apache/PHP with ~1000 VHosts
* Mail (Sendmail, imap, pop3)
* MySQL
## /etc/fstab:
/dev/rd/c0d0 / reiserfs defaults,usrquota,noatime,notail 1
1
/dev/rd/c0d1 /var/spool reiserfs defaults,usrquota,noatime,notail 1
1
/dev/hdb1 /hdb1 reiserfs defaults,noatime,notail 0 0
/dev/hda1 /boot ext2 defaults 1 1
/dev/hda2 swap swap defaults 0 0
/dev/hda3 swap swap defaults 0 0
none /dev/pts devpts gid=5,mode=620 0 0
none /proc proc defaults 0 0
## lspci:
00:00.0 Host bridge: Intel Corporation 440BX/ZX - 82443BX/ZX Host bridge
(rev 03)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX - 82443BX/ZX AGP bridge (rev
03)
00:07.0 ISA bridge: Intel Corporation 82371AB PIIX4 ISA (rev 02)
00:07.1 IDE interface: Intel Corporation 82371AB PIIX4 IDE (rev 01)
00:07.2 USB Controller: Intel Corporation 82371AB PIIX4 USB (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB PIIX4 ACPI (rev 02)
00:08.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone]
(rev 30)
00:09.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone]
(rev 30)
00:0a.0 PCI bridge: Intel Corporation: Unknown device 0964 (rev 02)
00:0a.1 RAID bus controller: Mylex Corporation: Unknown device 0050 (rev 02)
00:0c.0 SCSI storage controller: Adaptec AHA-2940U2/W / 7890
01:00.0 VGA compatible controller: S3 Inc. 86c368 [Trio 3D/2X] (rev 02)
This is my first post to LKML, please forgive me if I forgot some relevant
info.
Please Cc: replies as I'm not subscribed to LKML.
Best regards,
/Johan Ekenberg
> - My not-so-professional guess is that the machine is locked up waiting for
> some disk i/o that never happens, either to swap or normal filesystem. But,
> I might be all wrong.
I agree 100% with your diagnostic. Its directly as if your /var/spool volume
hung and the mylex stopped responding on that channel. I take it there is
nothing in dmesg ?
Another thing to try is
touch /foo &
hit return
(should report it finished)
touch /var/spool/foo &
(if this never returns you know you /var/spool choked for some reason)
> > - My not-so-professional guess is that the machine is locked
> > up waiting for some disk i/o that never happens, either to swap
> > or normal filesystem. But, I might be all wrong.
>
> I agree 100% with your diagnostic. Its directly as if your
> /var/spool volume hung and the mylex stopped responding
> on that channel.
/ and /var/spool are on the same channel on the Mylex. If it's a question of
an entire channel hung up, / must be blocked too. Please note, this happened
on several machines so I guess it's not a hardware fault in the Mylex card.
> I take it there is nothing in dmesg ?
I'll check that next time if the command is runnable.
> touch /foo &
> hit return
> (should report it finished)
> touch /var/spool/foo &
> (if this never returns you know you /var/spool choked for some reason)
Thanks, I'll try that also next time.
In the meantime, are there any patches I should apply or other things to
try? I'd rather see there is no "next time"... Since we also upgraded to
ReiserFS 3.6 it seems difficult/impossible to quickly revert to a
2.2-kernel.(?) These are production machines and people are getting upset
about these lockups.
All help appreciated,
regards Johan
We use the same motherboards. And for some reason if you put 1 gig in them
exactly and then in the kernel under "Processor type/High Memory Support" we
set it to use 4 gig it locks up the machine every once in a while. We ended
up putting in 1.5 gig of ram and that seemed to fix it. If you didn't
compile in the 4 gig support Linux wouldn't recognize the full 1 gig of
memory for some reason. This is on a Redhat 7.1 machine with 2.4.x kernel's.
---
Brad Dameron Network Account Executive
TSCNet Inc. http://www.tscnet.com
Silverdale, WA. 1-888-8TSCNET
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of Johan Ekenberg
> Sent: Tuesday, December 11, 2001 3:30 PM
> To: [email protected]
> Subject: Lockups with 2.4.14 and 2.4.16
>
>
> We recently upgraded 10 servers from 2.2.19 to 2.4.14/2.4.16. Since then,
> several servers have experienced severe lockups forcing hardware
> resets. The
> machines are Intel PIII (Dual) SMP running Epox motherboards. Here are the
> details:
>
> ## The Story:
> - Suddenly a machine gets a load average of about 500-1000.
> - It's not possible to log in either at the console or by SSH.
> - Some commands are possible to run through ssh from a remote
> server, like:
> "ssh badserver ps auxwf" or "ssh badserver free"
> - Despite a system load of 1000, commands like "free", "ps" and "uptime"
> often respond quickly, no "sluggishness".
> - The locked up machine seems to use all available memory plus a
> good deal
> of swap
> - The process table gets bigger and bigger, mainly ipop3d processes from
> users trying to fetch mail but getting no reply.
> - The processors seem to be mostly idle.
> - Killing processes doesn't work, not even with SIGKILL.
> - We haven't been able to find a time pattern for the lockups, or to
> reproduce them at will.
> - No kernel error messages are written to the console or logs.
> - Ctrl-alt-delete produces a "Rebooting"-message on the console,
> but there
> is no actual reboot. Power cycling is the only way out.
> - My not-so-professional guess is that the machine is locked up
> waiting for
> some disk i/o that never happens, either to swap or normal
> filesystem. But,
> I might be all wrong.
>
> ## Hardware:
> - Dual PIII 850 on Epox BXB-S and Epox KP6-BS
> - 1Gb RAM (4x256)
> - Mylex AcceleRAID 352 PCI RAID Controller,
> IBM disks, 3x36Gb Raid-5 mounted on /
> and 2x18 Raid-1 mounted on /var/spool
> - 1x20Gb IDE for /boot and swap (2 x 2Gb swap partitions)
> - 1x36Gb IDE for backups
>
> ## Kernel:
> - 2.4.14 and 2.4.16
> - Patched for reiserfs-quota with patches found at
> ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/
> ( * 50_quota-patch
> * dquota_deadlock
> * nesting
> * reiserfs-quota )
> - Complete kernel-config found here:
> http://www.ekenberg.se/2.4-trouble/2.4.16-config
> - Boot parameters are: "ether=0,0,eth1 panic=60 noapic"
>
> ## Filesystems:
> - ReiserFS (3.6) except /boot which is ext2
>
> ## General
> - The servers are used mainly for:
> * Apache/PHP with ~1000 VHosts
> * Mail (Sendmail, imap, pop3)
> * MySQL
>
> ## /etc/fstab:
> /dev/rd/c0d0 / reiserfs
> defaults,usrquota,noatime,notail 1
> 1
> /dev/rd/c0d1 /var/spool reiserfs
> defaults,usrquota,noatime,notail 1
> 1
> /dev/hdb1 /hdb1 reiserfs defaults,noatime,notail 0 0
> /dev/hda1 /boot ext2 defaults 1 1
> /dev/hda2 swap swap defaults 0 0
> /dev/hda3 swap swap defaults 0 0
> none /dev/pts devpts gid=5,mode=620 0 0
> none /proc proc defaults 0 0
>
> ## lspci:
> 00:00.0 Host bridge: Intel Corporation 440BX/ZX - 82443BX/ZX Host bridge
> (rev 03)
> 00:01.0 PCI bridge: Intel Corporation 440BX/ZX - 82443BX/ZX AGP
> bridge (rev
> 03)
> 00:07.0 ISA bridge: Intel Corporation 82371AB PIIX4 ISA (rev 02)
> 00:07.1 IDE interface: Intel Corporation 82371AB PIIX4 IDE (rev 01)
> 00:07.2 USB Controller: Intel Corporation 82371AB PIIX4 USB (rev 01)
> 00:07.3 Bridge: Intel Corporation 82371AB PIIX4 ACPI (rev 02)
> 00:08.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone]
> (rev 30)
> 00:09.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone]
> (rev 30)
> 00:0a.0 PCI bridge: Intel Corporation: Unknown device 0964 (rev 02)
> 00:0a.1 RAID bus controller: Mylex Corporation: Unknown device
> 0050 (rev 02)
> 00:0c.0 SCSI storage controller: Adaptec AHA-2940U2/W / 7890
> 01:00.0 VGA compatible controller: S3 Inc. 86c368 [Trio 3D/2X] (rev 02)
>
>
> This is my first post to LKML, please forgive me if I forgot some relevant
> info.
> Please Cc: replies as I'm not subscribed to LKML.
>
> Best regards,
> /Johan Ekenberg
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> > /var/spool volume hung and the mylex stopped responding
> > on that channel.
>
> / and /var/spool are on the same channel on the Mylex. If it's a question of
> an entire channel hung up, / must be blocked too. Please note, this happened
> on several machines so I guess it's not a hardware fault in the Mylex card.
I was using "channel" in a logical sense, not a physical one. I agree
its not likely to be hardware
> In the meantime, are there any patches I should apply or other things to
> try? I'd rather see there is no "next time"... Since we also upgraded to
> ReiserFS 3.6 it seems difficult/impossible to quickly revert to a
> 2.2-kernel.(?) These are production machines and people are getting upset
> about these lockups.
If anything unapplying patches might be the first move, to eliminate hangs
caused by - say races in the quota patches you applied. I don't know of any
but thats always got to be a question since I've not seen identical reports
before this one.
Alan
> We use the same motherboards. And for some reason if you put 1 gig in them
> exactly and then in the kernel under "Processor type/High Memory Support"
we
> set it to use 4 gig it locks up the machine every once in a while.
How would it lock up? Could you describe the symptoms in more detail so that
I can compare?
> We ended up putting in 1.5 gig of ram and that seemed to fix it.
Oh, how did you do that? We tried to put 2Gb on one board but it wouldn't
recognize more than 1 Gb. I'm fairly sure we flashed the bios with the
latest firmware. Which board is it you have, Epox BXB-S or Epox KP6-BS?
> If you didn't
> compile in the 4 gig support Linux wouldn't recognize the full 1 gig of
> memory for some reason. This is on a Redhat 7.1 machine with 2.4.x
kernel's.
Yes, I noticed that too. Unfortunately, I've tested with both 1G and 4G
support when configuring the kernel, and the i/o lockups still happen.
On Wednesday, December 12, 2001 12:29:38 AM +0100 Johan Ekenberg
<[email protected]> wrote:
> We recently upgraded 10 servers from 2.2.19 to 2.4.14/2.4.16. Since then,
> several servers have experienced severe lockups forcing hardware resets. The
> machines are Intel PIII (Dual) SMP running Epox motherboards. Here are the
> details:
>
>## Kernel:
> - 2.4.14 and 2.4.16
> - Patched for reiserfs-quota with patches found at
> ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/
> ( * 50_quota-patch
> * dquota_deadlock
> * nesting
> * reiserfs-quota )
For the 2.4.16 kernel, you used the quota patches from my 2.4.16 dir?
> - Complete kernel-config found here:
> http://www.ekenberg.se/2.4-trouble/2.4.16-config
> - Boot parameters are: "ether=0,0,eth1 panic=60 noapic"
>
>## Filesystems:
> - ReiserFS (3.6) except /boot which is ext2
>
>## General
> - The servers are used mainly for:
> * Apache/PHP with ~1000 VHosts
> * Mail (Sendmail, imap, pop3)
> * MySQL
Anyone know offhand if mysql uses mmap for writing to the database files?
The docs mention it for readonly compressed tables.
The fastest way to rule out filesystem deadlocks is to hook up a serial
console and send me the decoded output of sysrq-t.
-chris
> Another thing to try is
>
> touch /foo &
> hit return
> (should report it finished)
> touch /var/spool/foo &
> (if this never returns you know you /var/spool choked for some reason)
BTW, these commands don't work over SSH, ie the '&' doesn't produce a
background job + report-when-finished when running like:
ssh badserver "touch /foo &"
If I run without '&', would that just touch a file somewhere in the
cache-memory, ie not flushed to disk, or would it still detect if a disk is
hung? What's the point of running it in the bg anyway?
Is there any chance the lockup could be with one of the IDE disks running
swap or backups? Could that produce a global lockup of this kind?
## /etc/fstab:
/dev/rd/c0d0 / reiserfs defaults,usrquota,noatime,notail 1 1
/dev/rd/c0d1 /var/spool reiserfs defaults,usrquota,noatime,notail 1 1
/dev/hdb1 /backup reiserfs defaults,noatime,notail 0 0
/dev/hda1 /boot ext2 defaults 1 1
/dev/hda2 swap swap defaults 0 0
/dev/hda3 swap swap defaults 0 0
none /dev/pts devpts gid=5,mode=620 0 0
none /proc proc defaults 0 0
Best regards,
/Johan Ekenberg
> >## Kernel:
> > - 2.4.14 and 2.4.16
> > - Patched for reiserfs-quota with patches found at
> > ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/
> > ( * 50_quota-patch
> > * dquota_deadlock
> > * nesting
> > * reiserfs-quota )
>
> For the 2.4.16 kernel, you used the quota patches from my 2.4.16 dir?
Yes.
> The fastest way to rule out filesystem deadlocks is to hook up a serial
> console and send me the decoded output of sysrq-t.
I'll look into this. A bit of a problem since there are 10 servers and you
never know which one is going to lockup next time. Do I really need 10 PC's
to monitor them simultaneously or could it be done more efficiently? I'm no
kernel hacker, any pointers as to what tools to use etc would be most
welcome.
Thanks,
/Johan Ekenberg
Johan Ekenberg wrote:
>>>## Kernel:
>>> - 2.4.14 and 2.4.16
>>> - Patched for reiserfs-quota with patches found at
>>> ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/
>>> ( * 50_quota-patch
>>> * dquota_deadlock
>>> * nesting
>>> * reiserfs-quota )
>>>
>>For the 2.4.16 kernel, you used the quota patches from my 2.4.16 dir?
>>
>
>Yes.
>
>>The fastest way to rule out filesystem deadlocks is to hook up a serial
>>console and send me the decoded output of sysrq-t.
>>
>
>I'll look into this. A bit of a problem since there are 10 servers and you
>never know which one is going to lockup next time. Do I really need 10 PC's
>to monitor them simultaneously or could it be done more efficiently? I'm no
>kernel hacker, any pointers as to what tools to use etc would be most
>welcome.
>
>Thanks,
>/Johan Ekenberg
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
You need ten serial ports. Actually, a lot of machine rooms have all
their servers connected via serial lines to one machine (or at least
this was true years ago when I was a sysadmin). It makes it a lot
easier to administer them by remote access from home in the middle of
the night.
Hans
> BTW, these commands don't work over SSH, ie the '&' doesn't produce a
> background job + report-when-finished when running like:
> ssh badserver "touch /foo &"
> If I run without '&', would that just touch a file somewhere in the
> cache-memory, ie not flushed to disk, or would it still detect if a disk is
> hung? What's the point of running it in the bg anyway?
I was thinking "from a shell". Doing it with two ssh commands should show
which one if either hangs
> Is there any chance the lockup could be with one of the IDE disks running
> swap or backups? Could that produce a global lockup of this kind?
I think its a reiserfs/quota/dac960 deadlock though which is open to
question at the moment
On Wednesday, December 12, 2001 02:01:25 AM +0100 Johan Ekenberg
<[email protected]> wrote:
>> >## Kernel:
>> > - 2.4.14 and 2.4.16
>> > - Patched for reiserfs-quota with patches found at
>> > ftp://ftp.suse.com/pub/people/mason/patches/reiserfs/quota-2.4/
>> > ( * 50_quota-patch
>> > * dquota_deadlock
>> > * nesting
>> > * reiserfs-quota )
>>
>> For the 2.4.16 kernel, you used the quota patches from my 2.4.16 dir?
>
> Yes.
>
>> The fastest way to rule out filesystem deadlocks is to hook up a serial
>> console and send me the decoded output of sysrq-t.
>
> I'll look into this. A bit of a problem since there are 10 servers and you
> never know which one is going to lockup next time. Do I really need 10 PC's
> to monitor them simultaneously or could it be done more efficiently? I'm no
> kernel hacker, any pointers as to what tools to use etc would be most
> welcome.
For this case, it is enough to configure each kernel to allow a serial
console, wait for a machine to lockup, plugin the serial cable to that one
machine, and then do the sysrq-t.
But, test that method first ;-) You can hit sysrq-t at any time without
breaking things.
-chris
Ok, Johan sent along stack traces, and the deadlock works a little like this:
linux-2.4.16 + reiserfs + quota v2
kswapd ->
prune_icache->dispose_list->dquot_drop->commit_dquot->generic_file_write->
mark_inode_dirty->journal_begin-> wait for trans to end
Some process in the transaction is waiting on kswapd to free ram.
So, this will hit any journaled FS that uses quotas and logs inodes under
during a write. ext3 doesn't seem to do special things for quota anymore, so
it should be affected too.
The only fix I see is to make sure kswapd doesn't run shrink_icache, and to
have it done via a dedicated daemon instead. Does anyone have a better idea?
-chris
Chris Mason wrote:
>
> Ok, Johan sent along stack traces, and the deadlock works a little like this:
>
> linux-2.4.16 + reiserfs + quota v2
>
> kswapd ->
> prune_icache->dispose_list->dquot_drop->commit_dquot->generic_file_write->
> mark_inode_dirty->journal_begin-> wait for trans to end
uh-huh.
> Some process in the transaction is waiting on kswapd to free ram.
This is unfamiliar. Where does a process block on kswapd in this
manner? Not __alloc_pages I think.
> So, this will hit any journaled FS that uses quotas and logs inodes under
> during a write. ext3 doesn't seem to do special things for quota anymore, so
> it should be affected too.
mm.. most of the ext3 damage-avoidance hacks are around writepage().
> The only fix I see is to make sure kswapd doesn't run shrink_icache, and to
> have it done via a dedicated daemon instead. Does anyone have a better idea?
Well, we already need to do something like that to prevent the
abuse of keventd in there. It appears that somebody had a
problem with deadlocks doing the inode writeout in kswapd but
missed the quota problem.
Is it possible for the quota code to just bale out if PF_MEMALLOC
is set? To leave the dquot dirty?
-
On Friday, December 14, 2001 09:26:42 AM -0800 Andrew Morton
<[email protected]> wrote:
> Chris Mason wrote:
>>
>> Ok, Johan sent along stack traces, and the deadlock works a little like
>> this:
>>
>> linux-2.4.16 + reiserfs + quota v2
>>
>> kswapd ->
>> prune_icache->dispose_list->dquot_drop->commit_dquot->generic_file_write->
>> mark_inode_dirty->journal_begin-> wait for trans to end
>
> uh-huh.
>
>> Some process in the transaction is waiting on kswapd to free ram.
>
> This is unfamiliar. Where does a process block on kswapd in this
> manner? Not __alloc_pages I think.
It kinda blocks on kswapd by default when the process in the transaction
needs to read a block, or allocate one for the commit. Since kswapd is stuck
waiting on the log, eventually a process holding the transaction will try to
allocate something when there are no pages freeable with GFP_NOFS.
It was much worse when we just had GFP_BUFFER before, but the deadlock is
still there.
>
>> So, this will hit any journaled FS that uses quotas and logs inodes under
>> during a write. ext3 doesn't seem to do special things for quota anymore,
>> so it should be affected too.
>
> mm.. most of the ext3 damage-avoidance hacks are around writepage().
sct talked about how the ext3 data logging code allowed quotas to be
consistent after a crash. Perhaps this was just in 2.2.x...
>
>> The only fix I see is to make sure kswapd doesn't run shrink_icache, and to
>> have it done via a dedicated daemon instead. Does anyone have a better
>> idea?
>
> Well, we already need to do something like that to prevent the
> abuse of keventd in there. It appears that somebody had a
> problem with deadlocks doing the inode writeout in kswapd but
> missed the quota problem.
>
> Is it possible for the quota code to just bale out if PF_MEMALLOC
> is set? To leave the dquot dirty?
We could change prune_icache to skip inodes with dirty quota fields. It
already skips dirty inodes, so this isn't a huge change.
I'll try this, and also add kinoded so we can avoid using keventd. I'm wary
of the affects on the VM if kinoded can't keep up though, so I'd like to keep
the shrink_icache call in kswapd if possible.
Johan, expect this to take at least a week before I suggest installing on
production machines. Things are very intertwined here, and these changes
will probably have side effects that need dealing with.
Turning quotas off will solve the problem in the short term.
-chris
On Fri, Dec 14, 2001 at 12:53:00PM -0500, Chris Mason wrote:
> I'll try this, and also add kinoded so we can avoid using keventd. I'm wary
using keventd for that doesn't look too bad to me. Just like we do with
the dirty inode flushing. keventd doesn't do anything 99.9% of the time,
so it sounds a bit wasteful to add yet another daemon that will remain
idle 99% of the time too... :)
Andrea
Andrea Arcangeli wrote:
>
> On Fri, Dec 14, 2001 at 12:53:00PM -0500, Chris Mason wrote:
> > I'll try this, and also add kinoded so we can avoid using keventd. I'm wary
>
> using keventd for that doesn't look too bad to me. Just like we do with
> the dirty inode flushing. keventd doesn't do anything 99.9% of the time,
> so it sounds a bit wasteful to add yet another daemon that will remain
> idle 99% of the time too... :)
Well heck, let's use ksoftirqd then :)
keventd is used for real-time things - deferred interrupt
actions. It should be SCHED_FIFO.
Actually, kupdated almost does what's needed already. I
suspect a wakeup_kupdate() would suffice.
-
On Friday, December 14, 2001 07:32:17 PM +0100 Andrea Arcangeli
<[email protected]> wrote:
> On Fri, Dec 14, 2001 at 12:53:00PM -0500, Chris Mason wrote:
>> I'll try this, and also add kinoded so we can avoid using keventd. I'm
>> wary
>
> using keventd for that doesn't look too bad to me. Just like we do with
> the dirty inode flushing. keventd doesn't do anything 99.9% of the time,
> so it sounds a bit wasteful to add yet another daemon that will remain
> idle 99% of the time too... :)
I think Andrew's idea was to avoid using it for inode flushing as well.
These are very time consuming tasks (especially if the journal is involved),
making keventd less repsonive to the short tasks it was intended to run.
-chris
On Fri, Dec 14, 2001 at 10:57:56AM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > On Fri, Dec 14, 2001 at 12:53:00PM -0500, Chris Mason wrote:
> > > I'll try this, and also add kinoded so we can avoid using keventd. I'm wary
> >
> > using keventd for that doesn't look too bad to me. Just like we do with
> > the dirty inode flushing. keventd doesn't do anything 99.9% of the time,
> > so it sounds a bit wasteful to add yet another daemon that will remain
> > idle 99% of the time too... :)
>
> Well heck, let's use ksoftirqd then :)
:)
ksoftirqd can run quite heavily sometime (it's needed for an efficient
NAPI for example) and it's not a general purpose kernel thread, and
all its work never blocks.
> keventd is used for real-time things - deferred interrupt
> actions. It should be SCHED_FIFO.
The true fact is that keventd is _not_ SCHED_FIFO in 2.[245] and in turn
it _cannot_ be used for real time things.
So if keventd is currently used for real-time things, those real-time
things are malfunctioning right now, no matter of the dirty inode/quota
flushing.
furthmore the only point of keventd compared to a tasklet is that
keventd queued-tasks can _sleep_, and so all the users of keventd should
be used to block (if they cannot block they should use a taslket instead
that has a chance to be faster, per-cpu cache locality etc...).
> Actually, kupdated almost does what's needed already. I
> suspect a wakeup_kupdate() would suffice.
Probably yes, however it would be nice to be able to push inode buffers
to disk while the buffers are getting flushed. So queueing the work to
keventd (or adding a kinoded) still sounds better to me :).
Andrea
Hello,
> Chris Mason wrote:
> > Ok, Johan sent along stack traces, and the deadlock works a little like this:
> >
> > linux-2.4.16 + reiserfs + quota v2
> >
> > kswapd ->
> > prune_icache->dispose_list->dquot_drop->commit_dquot->generic_file_write->
> > mark_inode_dirty->journal_begin-> wait for trans to end
>
> uh-huh.
...
> > The only fix I see is to make sure kswapd doesn't run shrink_icache, and to
> > have it done via a dedicated daemon instead. Does anyone have a better idea?
>
> Well, we already need to do something like that to prevent the
> abuse of keventd in there. It appears that somebody had a
> problem with deadlocks doing the inode writeout in kswapd but
> missed the quota problem.
>
> Is it possible for the quota code to just bale out if PF_MEMALLOC
> is set? To leave the dquot dirty?
Nope. Writing of dquots works following way:
When dquot is used (referenced by inode) it's written only on explicit sync.
Otherwise it's just marked dirty. When last reference from inode to dquot is
dropped, dquot is written to disk (if marked dirty). Then it's put to the free
list from where it can be evicted by prune_dqcache().
So when you don't write dquot when the last reference is dropped you
have to solve following issues:
* you need to assure dquot is written sometimes
* you need to make prune_dqcache() skip dirtified dquots
The first problem probably can be solved using bdflush...
The second one is probably trivial as dquots use small amount of memory (~100 bytes
per user) and so we don't have to care much about vm balancing.
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
> On Friday, December 14, 2001 09:26:42 AM -0800 Andrew Morton
> <[email protected]> wrote:
>
> >> So, this will hit any journaled FS that uses quotas and logs inodes under
> >> during a write. ext3 doesn't seem to do special things for quota anymore,
> >> so it should be affected too.
> >
> > mm.. most of the ext3 damage-avoidance hacks are around writepage().
>
> sct talked about how the ext3 data logging code allowed quotas to be
> consistent after a crash. Perhaps this was just in 2.2.x...
>
> >
> >> The only fix I see is to make sure kswapd doesn't run shrink_icache, and to
> >> have it done via a dedicated daemon instead. Does anyone have a better
> >> idea?
> >
> > Well, we already need to do something like that to prevent the
> > abuse of keventd in there. It appears that somebody had a
> > problem with deadlocks doing the inode writeout in kswapd but
> > missed the quota problem.
> >
> > Is it possible for the quota code to just bale out if PF_MEMALLOC
> > is set? To leave the dquot dirty?
>
> We could change prune_icache to skip inodes with dirty quota fields. It
> already skips dirty inodes, so this isn't a huge change.
Umm. I don't think it's good idea (at least with current state) - if any
data was written to inode, quota was dirtified and probably never written
so you're going to have *lots* of inodes with dirty quota... So you end up
freeing nothing.
I think that leaving writing of dquots on bdflush or some other thread
would be better solution...
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
Ok, there is another deadlock possible where kswapd goes into a transaction:
shrink_dcache_memory->prune_dcache->dput->iput->delete_inode()
So, I'm changing my new kinoded to run shrink_dcache_memory as well.
ick.
-chris