2009-01-27 09:49:18

by Ralf Hildebrandt

[permalink] [raw]
Subject: sync-Regression in 2.6.28.2?

I recently installed 2.6.28.2 on our postfix/dovecot-based
mailboxserver. Previously, 2.6.28 and 2.6.28.1 have been running there
without a hitch.

Now with 2.6.28.2 I had two major lockups: All writes to the users'
Maildirs (on ext4) would stall, the load would rise, "sync" would never
return.

I had to "reboot -f -n" to get the machine back. All hanging processes
were unkillable, even with kill -9.

There was NOTHING in dmesg, nothing in the log either. No indication
of filesystem errors or anything.

Lookin at the
http://www.eu.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.28.2
I see changes to

fs: sys_sync fix
fs: sync_sb_inodes fix
fs: remove WB_SYNC_HOLD

Since the mailserver uses sync() extensively (both dovecot and
postfix), maybe that's the culprit?

Note: I'm really just guessing here!
Back to 2.6.28.1 for now.

--
Ralf Hildebrandt [email protected]
Charite - Universitätsmedizin Berlin Tel. +49 (0)30-450 570-155
Geschäftsbereich IT | Abt. Netzwerk Fax. +49 (0)30-450 570-962
Hindenburgdamm 30 | 12200 Berlin


2009-01-27 21:16:30

by Federico Cuello

[permalink] [raw]
Subject: Re: sync-Regression in 2.6.28.2?

Ralf Hildebrandt escribió:
> I recently installed 2.6.28.2 on our postfix/dovecot-based
> mailboxserver. Previously, 2.6.28 and 2.6.28.1 have been running there
> without a hitch.
>
> Now with 2.6.28.2 I had two major lockups: All writes to the users'
> Maildirs (on ext4) would stall, the load would rise, "sync" would never
> return.
>
> I had to "reboot -f -n" to get the machine back. All hanging processes
> were unkillable, even with kill -9.
> [...]

The same is happening to me, but I have some logs taken with sysrq.

Here is my vmstat output:

procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
0 2 99028 46536 26356 1569400 0 0 0 0 1400 437 0
3 0 96
0 2 99028 46536 26356 1569400 0 0 0 0 1344 355 0
6 0 94
0 2 99028 46536 26356 1569400 0 0 0 0 1373 387 0
0 0 100
0 2 99028 46536 26364 1569400 0 0 0 12 1403 384 1
0 0 99
0 2 99028 46536 26364 1569400 0 0 0 0 1370 378 0
0 0 100
0 2 99028 46536 26364 1569400 0 0 0 0 1351 346 0
0 0 100
0 2 99028 46536 26364 1569400 0 0 0 0 1395 412 0
0 0 100
0 2 99028 46536 26364 1569400 0 0 0 0 1349 332 0
0 0 100
0 2 99028 46536 26368 1569400 0 0 4 0 1407 387 0
0 0 100

Notice the 100% iowait.

I also managed to reproduce it doing a rsync from one partition to a USB
drive. After the lockup I can't read any file from the source partition,
but the other partitions can be accessed normally.