2008-02-28 22:32:50

by [email protected]

[permalink] [raw]
Subject: Re: kernel 2.6 and simulated flock() with posix locks

On Mon, Feb 25, 2008 at 06:42:35PM +0200, Thanos Chatziathanassiou wrote:
> J. Bruce Fields wrote:
>> On Mon, Feb 25, 2008 at 03:20:29PM +0200, Thanos Chatziathanassiou wrote:
>>
>>> Hi,
>>>
>>> I've been trying to replace kernel 2.4 in a web server mounting its Document Root via NFS with kernel 2.6 and faced a rather disturbing problem.
>>> About 1/2 hour after starting, the server would stop serving requests though it seemed fine.
>>> Earlier 2.6 kernels exhibited the ``do_vfs_lock: VFS is out of sync with lock manager!'' symptom, later (when this was changed to a dprintk()) just sat there.
>>> No apparent error apart from apache compaining ``[error] server reached MaxClients setting, consider raising the MaxClients setting'', unable to serve any requests.
>>>
>>> This issue does not surface under 2.4, where everything works as expected.
>>> I came across this
>>> (http://blog.notreally.org/articles/2007/12/19/modifying-a-live-linux-kernel/)
>>> where apparently they faced the same problem, but their solution
>>> (which seemed a little crude) resulted in apache spitting ``There are
>>> no available locks'' messages (or roughly this, translated from my
>>> regional settings).
>>>
>>> Is there any solution to this or a way to get 2.4 behavior under 2.6 ?
>>>
>>
>> I'm a little confused--how do you know that the problem you face is the
>> same as the one described on the blog above? Are you re-exporting NFS
>> via Samba?
>>
>> --b.
>>
> Indeed I am. But I am willing to convince you ;) What kind of debug info
> would I need to collect to find out what really the problem is ?

Can you give a more detailed explanation of the symptoms? For example,
when you say "the server would stop serving requests", are you referring
to the web server or the nfs server? If you think the problem is that
Apache is hanging on a lock, you should be able to verify that with
strace or /proc/locks or a sysrq-T trace.

--b.


2008-02-29 15:21:43

by Thanos Chatziathanassiou

[permalink] [raw]
Subject: Re: kernel 2.6 and simulated flock() with posix locks

J. Bruce Fields wrote:
> On Mon, Feb 25, 2008 at 06:42:35PM +0200, Thanos Chatziathanassiou wrote:
>
>> J. Bruce Fields wrote:
>>
>>> On Mon, Feb 25, 2008 at 03:20:29PM +0200, Thanos Chatziathanassiou wrote:
>>>
>>>
>>>> Hi,
>>>>
>>>> I've been trying to replace kernel 2.4 in a web server mounting its Document Root via NFS with kernel 2.6 and faced a rather disturbing problem.
>>>> About 1/2 hour after starting, the server would stop serving requests though it seemed fine.
>>>> Earlier 2.6 kernels exhibited the ``do_vfs_lock: VFS is out of sync with lock manager!'' symptom, later (when this was changed to a dprintk()) just sat there.
>>>> No apparent error apart from apache compaining ``[error] server reached MaxClients setting, consider raising the MaxClients setting'', unable to serve any requests.
>>>>
>>>> This issue does not surface under 2.4, where everything works as expected.
>>>> I came across this
>>>> (http://blog.notreally.org/articles/2007/12/19/modifying-a-live-linux-kernel/)
>>>> where apparently they faced the same problem, but their solution
>>>> (which seemed a little crude) resulted in apache spitting ``There are
>>>> no available locks'' messages (or roughly this, translated from my
>>>> regional settings).
>>>>
>>>> Is there any solution to this or a way to get 2.4 behavior under 2.6 ?
>>>>
>>>>
>>> I'm a little confused--how do you know that the problem you face is the
>>> same as the one described on the blog above? Are you re-exporting NFS
>>> via Samba?
>>>
>>> --b.
>>>
>>>
>> Indeed I am. But I am willing to convince you ;) What kind of debug info
>> would I need to collect to find out what really the problem is ?
>>
>
> Can you give a more detailed explanation of the symptoms? For example,
> when you say "the server would stop serving requests", are you referring
> to the web server or the nfs server?
sorry if I wasn't clear on this. this particular (stock 2.6.16.60) web
server stops serving requests.
the nfs server (2.6.12.6 based) as well as other (2.4 based) web servers
continue humming along just fine.
> If you think the problem is that
> Apache is hanging on a lock, you should be able to verify that with
> strace or /proc/locks
well, /proc/locks doesn't tell much...
---snip---
www4:~# cat /proc/locks
1: FLOCK ADVISORY WRITE 2512 08:07:829070 0 EOF
2: POSIX ADVISORY READ 2459 08:07:1284232 0 EOF
3: POSIX ADVISORY WRITE 2454 08:07:829066 0 EOF
---snip---
process 2459 is
root 2459 0.0 0.0 1552 500 ? S 16:07 0:00 ypbind
(slave)
and 2454 is
root 2454 0.0 0.0 1532 448 ? S 16:07 0:00 ypbind
(master)
...I couldn't find 2512 (?) in the process table.

however,
straceing random httpd processes, yields:
---snip---
strace -p 22149
flock(11, LOCK_EX
---snip---

...which is understandably blocking
unfortunately, this child did not ever get to write what it was serving
at the time to the access and/or error log, but we can (safely ?) assume
it'd be some mod_perl script that called flock().

let me know if I can grab anything else
> or a sysrq-T trace.
>
> --b.
>


Attachments:
smime.p7s (3.15 kB)
S/MIME Cryptographic Signature