LinuxLists.cc - Unexpected cascaded epoll behavior

2009-01-23 00:28:06

Subject: Unexpected cascaded epoll behavior - my mistake or kernel bug

Hello Davide and all others,

I have got to implementing yet another event library and I experience
something strange when level triggered epoll_wait() monitors another
level triggered epoll set. Top level epoll_wait() return information
about one event pending. The event points correctly to the monitored
second level epoll fd, but epoll_wait() on this fd with 0 or small timeout
returns zero/no events and top level epoll reporting is not reset at this
point so program enters busy loop. If there are events, they are processed
correctly. More interresting is, that the ill behavior is prevented if I
change number of debug writes into unrelated fd or if the application
is slowed down by strace to standard output. The strace to the
file does not (fortunatelly) hide ill behavior to the strange behavior
of my code or kernel can be documented in attached straces. The both
traces documents the busy loop cause for case, where there is created
epoll set with fd #3, that set monitors fd #0. When character arrives
at fd #0, the new epoll fd set with fd #4 is created ad fd #3 is added
into this new top level fd set.

epoll_create(8) = 3
epoll_ctl(3, EPOLL_CTL_ADD, 0, {EPOLLIN, {u32=6353264, u64=6353264}}) = 0
epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 20000) = 1
read(0, "E"..., 1) = 1
...
epoll_create(8) = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 0) = 1
read(0, "\n"..., 1) = 1
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
epoll_wait(3, {}, 8, 0) = 0

You may ask, why that terrible clam starts to writting new event library code
and why he needs to cascade event handling mechanisms. I try to defend myself
a little. If you do not like beat me, you can skip next section

==============================================

I am part of CTU university group working on realtime and industrial control
projects (RT-Ethernet, CAN, CANopen, profibus etc). I work even for PiKRON
company on other similar projects. We have need for base libraries for fast
and small libraries for Linux, RTEMS and system less based applications.
Some runs with as less as LPC21xx and 32kB of RAM memory. We need some
generic containers code, XML or bytecode driven graphic and support for test
builds of system less APPs or RTEMS based apps on Linux and preparation
of higher control hierarchy applications on Linux.

The GLIB would be good base, but for embedded systems the LGPL license
could be problem, the 32kB RAM and 256 kB FLASH for whole application
in some cases are other reasons why GLIB is not usable for many cases.
Even GLIB implementation is nice from portability point, but effectiveness
of GLIB main loop, complexity of containers with needs of plenty of mallocs
is year another story. So the containers and generic constructs are solved
by uLUt library http://rtime.felk.cvut.cz/gitweb/ulut.git ,
http://cmp.felk.cvut.cz/~pisa/ulan/ulut.pdf.

Event processing on Linux or POSIX based OSes, I thought that libevent
is right way to go. It is yet another prerequisite for projects (acceptable),
but problem is, that in some cases, the libraries should allow the combination
with frameworks based on other mainloop implementation (Qt, GTK, Python).
Problem is, that combining libraries expecting different main loop
implementation is a problem. That is why the ul_evpoll library has been
born. The initial idea has been to define absolute minimal compatability
layer and API, which allows to run applications above libevent and in case
of the need above GLIB. To test that concept and allow use of the library
even standalone, the simple implementation above Sys V poll() call
has been included. To test Linux most advanced feature, simple epoll
support has been added. The yet another target has been met, when
applications could run over ul_evpoll API under GLIB main loop now.
This solves even Qt compatibility, because most distributions ships
Qt compiled against GLIB main loop.
The other important feature is to be able to switch or cascade event
monitoring at runtime when some third party library enforcing
different main loop implementation is dynamically loaded.
This goal has been achieved as well now. Next complex scenarios
works now

* start on ul_evpoll sysvpoll, when GLIB based library is required
create new ul_evpoll based on glib, cascade original set with
epoll above it and continue to run with GLIB main loop

* start with ul_evpoll lnxepoll, create ul_evpoll based on glib
continue to run

The events can be inserted by GLIB based applications as glib event sources,
by uLUt based applications as ul_evpoll events into ul_evpoll wrapper or
over original sysvpoll or lnxepoll event bases and all runs concurrently
without problems in my first test round. The lnxepoll cascaded over glib base
has other nice feature, that it corrects GLIB ill behavior for C10K problem for these
events, which are registered over ul_evpoll API as ul_evptrig_t into epoll based
ul_evpbase_t.

But because of generic approach of the code, I have tried to cascade
one ul_evpoll lnxepoll above another one for my curiosity and it does not
work as expected. It may be my bug, but strace result is strange
for me.

==============================================

The ul_evpoll library code is developed in uLan CVS at this moment

http://ulan.cvs.sourceforge.net/viewvc/ulan/ulan/host/libs4c/ulevloop/

Preconfigured code snapshot is there

http://cmp.felk.cvut.cz/~pisa/ulan/ul_evpoll-090123.tar.gz

The bug triggers on my system, when Linux epoll based event loop is started

_compiled/bin-utils/ul_evpchk lnxepoll

and then key 'E' and Enter is pressed. This creates new epoll base
and cascades original event base into it.

Some parts from uLUt, OMK and ul_evpoll are parts of other projects already

http://ulan.sourceforge.net/
http://www.ocera.org/download/components/WP7/index.html
http://frescor.org/
http://rtime.felk.cvut.cz/gitweb/frescor/forb.git

Best wishes and thanks for possible ideas and comments,

Pavel Pisa
e-mail: [email protected]
www: http://cmp.felk.cvut.cz/~pisa

==============================================

The test system is

Linux baree 2.6.27 #1 SMP PREEMPT Mon Oct 13 02:16:44 CEST 2008 x86_64 GNU/Linux

Core 2 Duo based MSI Laptop

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz
stepping : 10
cpu MHz : 800.000
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc
arch_perfmon pebs bts rep_good nopl pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm ida
bogomips : 3989.91
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

==============================================

strace output without additional epoll fd retriggering

execve("/home/pi/repo/ulan/ulan-build/host/_compiled/bin-utils/ul_evpchk", ["/home/pi/repo/ulan/ulan-build/ho"..., "lnxepoll"], [/* 38 vars */]) = 0
brk(0) = 0x60f000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4f03a5f000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4f03a5d000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=149762, ...}) = 0
mmap(NULL, 149762, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f4f03a38000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/usr/lib/libglib-2.0.so.0", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0PM\1\0\0\0\0\0@"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=794168, ...}) = 0
mmap(NULL, 2891336, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4f03583000
mprotect(0x7f4f03644000, 2093056, PROT_NONE) = 0
mmap(0x7f4f03843000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xc0000) = 0x7f4f03843000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\342\1\0\0\0\0\0@"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1375536, ...}) = 0
mmap(NULL, 3482232, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4f03230000
mprotect(0x7f4f0337a000, 2093056, PROT_NONE) = 0
mmap(0x7f4f03579000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x149000) = 0x7f4f03579000
mmap(0x7f4f0357e000, 17016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f4f0357e000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/usr/lib/libpcre.so.3", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\25\0\0\0\0\0\0@"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=162816, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4f03a37000
mmap(NULL, 2258096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4f03008000
mprotect(0x7f4f03030000, 2093056, PROT_NONE) = 0
mmap(0x7f4f0322f000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x27000) = 0x7f4f0322f000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4f03a36000
arch_prctl(ARCH_SET_FS, 0x7f4f03a366e0) = 0
mprotect(0x7f4f03579000, 12288, PROT_READ) = 0
munmap(0x7f4f03a38000, 149762) = 0
brk(0) = 0x60f000
brk(0x630000) = 0x630000
epoll_create(8) = 3
epoll_ctl(3, EPOLL_CTL_ADD, 0, {EPOLLIN, {u32=6353264, u64=6353264}}) = 0
epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 20000) = 1
read(0, "E"..., 1) = 1
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4f03a5c000
epoll_create(8) = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 0) = 1
read(0, "\n"..., 1) = 1
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19997) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19996) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19995) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19995) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19994) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19993) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19992) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19992) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19991) = 1
epoll_wait(3, {}, 8, 0) = 0
............
............
BUSY LOOP THERE, epoll_wait on #4 instantly reports event
pointing on #3, but no event read on #3
............
............
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 17180) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 17180) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 17180) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 17179) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 17179) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 17178) = 1
epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 0) = 1
read(0, "q"..., 1) = 1
epoll_ctl(4, EPOLL_CTL_DEL, 3, {0, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 0, {0, {u32=6353264, u64=6353264}}) = 0
close(3) = 0
exit_group(0) = ?

==============================================

strace output with additional epoll fd retriggering after each event

execve("/home/pi/repo/ulan/ulan-build/host/_compiled/bin-utils/ul_evpchk", ["/home/pi/repo/ulan/ulan-build/ho"..., "lnxepoll"], [/* 38 vars */]) = 0
brk(0) = 0x60f000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f12db6fa000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f12db6f8000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=149762, ...}) = 0
mmap(NULL, 149762, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f12db6d3000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/usr/lib/libglib-2.0.so.0", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0PM\1\0\0\0\0\0@"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=794168, ...}) = 0
mmap(NULL, 2891336, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f12db21e000
mprotect(0x7f12db2df000, 2093056, PROT_NONE) = 0
mmap(0x7f12db4de000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xc0000) = 0x7f12db4de000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\342\1\0\0\0\0\0@"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1375536, ...}) = 0
mmap(NULL, 3482232, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f12daecb000
mprotect(0x7f12db015000, 2093056, PROT_NONE) = 0
mmap(0x7f12db214000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x149000) = 0x7f12db214000
mmap(0x7f12db219000, 17016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f12db219000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/usr/lib/libpcre.so.3", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\25\0\0\0\0\0\0@"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=162816, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f12db6d2000
mmap(NULL, 2258096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f12daca3000
mprotect(0x7f12daccb000, 2093056, PROT_NONE) = 0
mmap(0x7f12daeca000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x27000) = 0x7f12daeca000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f12db6d1000
arch_prctl(ARCH_SET_FS, 0x7f12db6d16e0) = 0
mprotect(0x7f12db214000, 12288, PROT_READ) = 0
munmap(0x7f12db6d3000, 149762) = 0
brk(0) = 0x60f000
brk(0x630000) = 0x630000
epoll_create(8) = 3
epoll_ctl(3, EPOLL_CTL_ADD, 0, {EPOLLIN, {u32=6353264, u64=6353264}}) = 0
epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 20000) = 1
read(0, "E"..., 1) = 1
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f12db6f7000
epoll_create(8) = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19997) = 1
epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 0) = 1
read(0, "\n"..., 1) = 1
epoll_ctl(4, EPOLL_CTL_DEL, 3, {0, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 3, {0, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19997) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 3, {0, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19995) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 3, {0, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19994) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 3, {0, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19993) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 3, {0, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19991) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 3, {0, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19990) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 3, {0, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19988) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 3, {0, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
............
............
BUSY LOOP THERE, epoll_wait on #4 instantly reports event
pointing on #3, but no event read on #3
............
............
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 18254) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 3, {0, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 18253) = 1
epoll_wait(3, {}, 8, 0) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 3, {0, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 18251) = 1
epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 0) = 1
read(0, "q"..., 1) = 1
epoll_ctl(4, EPOLL_CTL_DEL, 3, {0, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(4, EPOLL_CTL_DEL, 3, {0, {u32=6353664, u64=6353664}}) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 0, {0, {u32=6353264, u64=6353264}}) = 0
close(3) = 0
exit_group(0) = ?

Attachments:

(No filename) (20.33 kB)
signature.asc (197.00 B)
This is a digitally signed message part. Download all attachments

2009-01-23 01:33:21

by Davide Libenzi

[permalink] [raw]

Subject: Re: Unexpected cascaded epoll behavior - my mistake or kernel bug

2009-01-24 21:59:19

by Davide Libenzi

[permalink] [raw]

Subject: Re: Unexpected cascaded epoll behavior - my mistake or kernel bug

On Fri, 23 Jan 2009, Pavel Pisa wrote:

> Hello Davide and all others,
>
> I have got to implementing yet another event library and I experience
> something strange when level triggered epoll_wait() monitors another
> level triggered epoll set. Top level epoll_wait() return information
> about one event pending. The event points correctly to the monitored
> second level epoll fd, but epoll_wait() on this fd with 0 or small timeout
> returns zero/no events and top level epoll reporting is not reset at this
> point so program enters busy loop. If there are events, they are processed
> correctly. More interresting is, that the ill behavior is prevented if I
> change number of debug writes into unrelated fd or if the application
> is slowed down by strace to standard output. The strace to the
> file does not (fortunatelly) hide ill behavior to the strange behavior
> of my code or kernel can be documented in attached straces. The both
> traces documents the busy loop cause for case, where there is created
> epoll set with fd #3, that set monitors fd #0. When character arrives
> at fd #0, the new epoll fd set with fd #4 is created ad fd #3 is added
> into this new top level fd set.
>
> epoll_create(8) = 3
> epoll_ctl(3, EPOLL_CTL_ADD, 0, {EPOLLIN, {u32=6353264, u64=6353264}}) = 0
> epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 20000) = 1
> read(0, "E"..., 1) = 1
> ...
> epoll_create(8) = 4
> epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
> epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
> epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 0) = 1
> read(0, "\n"..., 1) = 1
> epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
> epoll_wait(3, {}, 8, 0) = 0
> epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
> epoll_wait(3, {}, 8, 0) = 0
>
> You may ask, why that terrible clam starts to writting new event library code
> and why he needs to cascade event handling mechanisms. I try to defend myself
> a little. If you do not like beat me, you can skip next section

Pavel, can you give the patch below a spin? It's working fine in my two
boxes, but it'd be great if you could confirm it too.
Attached there's also a simple test program to check epoll behavior,
especially the one related to recursion.

- Davide

---
fs/eventpoll.c | 495 ++++++++++++++++++++++++++++++++++-----------------------
1 file changed, 296 insertions(+), 199 deletions(-)

Index: linux-2.6.mod/fs/eventpoll.c
===================================================================
--- linux-2.6.mod.orig/fs/eventpoll.c 2009-01-24 10:06:52.000000000 -0800
+++ linux-2.6.mod/fs/eventpoll.c 2009-01-24 13:07:54.000000000 -0800
@@ -1,6 +1,6 @@
/*
- * fs/eventpoll.c (Efficent event polling implementation)
- * Copyright (C) 2001,...,2007 Davide Libenzi
+ * fs/eventpoll.c (Efficient event retrieval implementation)
+ * Copyright (C) 2001,...,2009 Davide Libenzi
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
@@ -92,8 +92,8 @@
/* Epoll private bits inside the event mask */
#define EP_PRIVATE_BITS (EPOLLONESHOT | EPOLLET)

-/* Maximum number of poll wake up nests we are allowing */
-#define EP_MAX_POLLWAKE_NESTS 4
+/* Maximum number of nesting allowed inside epoll sets */
+#define EP_MAX_NESTS 4

/* Maximum msec timeout value storeable in a long int */
#define EP_MAX_MSTIMEO min(1000ULL * MAX_SCHEDULE_TIMEOUT / HZ, (LONG_MAX - 999ULL) / HZ)
@@ -110,24 +110,21 @@
};

/*
- * Node that is linked into the "wake_task_list" member of the "struct poll_safewake".
- * It is used to keep track on all tasks that are currently inside the wake_up() code
- * to 1) short-circuit the one coming from the same task and same wait queue head
- * (loop) 2) allow a maximum number of epoll descriptors inclusion nesting
- * 3) let go the ones coming from other tasks.
+ * Structure used to track possible nested calls, for too deep recursions
+ * and loop cycles.
*/
-struct wake_task_node {
+struct nested_call_node {
struct list_head llink;
struct task_struct *task;
- wait_queue_head_t *wq;
+ void *cookie;
};

/*
- * This is used to implement the safe poll wake up avoiding to reenter
- * the poll callback from inside wake_up().
+ * This structure is used as collector for nested calls, to check for
+ * maximum recursion dept and loop cycles.
*/
-struct poll_safewake {
- struct list_head wake_task_list;
+struct nested_calls {
+ struct list_head tasks_call_list;
spinlock_t lock;
};

@@ -231,6 +228,12 @@
struct epitem *epi;
};

+/* Used by the ep_send_events() function as callback private data */
+struct ep_send_events_data {
+ int maxevents;
+ struct epoll_event __user *events;
+};
+
/*
* Configuration options available inside /proc/sys/fs/epoll/
*/
@@ -244,8 +247,11 @@
*/
static DEFINE_MUTEX(epmutex);

-/* Safe wake up implementation */
-static struct poll_safewake psw;
+/* Used for safe wake up implementation */
+static struct nested_calls poll_safewake_ncalls;
+
+/* Used to call file's f_op->poll() under the nested calls boundaries */
+static struct nested_calls poll_readywalk_ncalls;

/* Slab cache used to allocate "struct epitem" */
static struct kmem_cache *epi_cache __read_mostly;
@@ -322,64 +328,95 @@
}

/* Initialize the poll safe wake up structure */
-static void ep_poll_safewake_init(struct poll_safewake *psw)
+static void ep_nested_calls_init(struct nested_calls *ncalls)
{
-
- INIT_LIST_HEAD(&psw->wake_task_list);
- spin_lock_init(&psw->lock);
+ INIT_LIST_HEAD(&ncalls->tasks_call_list);
+ spin_lock_init(&ncalls->lock);
}

-/*
- * Perform a safe wake up of the poll wait list. The problem is that
- * with the new callback'd wake up system, it is possible that the
- * poll callback is reentered from inside the call to wake_up() done
- * on the poll wait queue head. The rule is that we cannot reenter the
- * wake up code from the same task more than EP_MAX_POLLWAKE_NESTS times,
- * and we cannot reenter the same wait queue head at all. This will
- * enable to have a hierarchy of epoll file descriptor of no more than
- * EP_MAX_POLLWAKE_NESTS deep. We need the irq version of the spin lock
- * because this one gets called by the poll callback, that in turn is called
- * from inside a wake_up(), that might be called from irq context.
+/**
+ * ep_call_nested - Perform a bound (possibly) nested call, by checking
+ * that the recursion limit is not exceeded, and that
+ * the same nested call (by the meaning of same cookie) is
+ * no re-entered.
+ *
+ * @ncalls: Pointer to the nested_calls structure to be used for this call.
+ * @max_nests: Maximum number of allowed nesting calls.
+ * @nproc: Nested call core function pointer.
+ * @priv: Opaque data to be passed to the @nproc callback.
+ * @cookie: Cookie to be used to identify this nested call.
+ *
+ * Returns: Returns the code returned by the @nproc callback, or -1 if
+ * the maximum recursion limit has been exceeded.
*/
-static void ep_poll_safewake(struct poll_safewake *psw, wait_queue_head_t *wq)
+static int ep_call_nested(struct nested_calls *ncalls, int max_nests,
+ int (*nproc)(void *, void *, int), void *priv,
+ void *cookie)
{
- int wake_nests = 0;
+ int error, call_nests = 0;
unsigned long flags;
struct task_struct *this_task = current;
- struct list_head *lsthead = &psw->wake_task_list;
- struct wake_task_node *tncur;
- struct wake_task_node tnode;
+ struct list_head *lsthead = &ncalls->tasks_call_list;
+ struct nested_call_node *tncur;
+ struct nested_call_node tnode;

- spin_lock_irqsave(&psw->lock, flags);
+ spin_lock_irqsave(&ncalls->lock, flags);

- /* Try to see if the current task is already inside this wakeup call */
+ /*
+ * Try to see if the current task is already inside this wakeup call.
+ * We use a list here, since the population inside this set is always
+ * very much limited.
+ */
list_for_each_entry(tncur, lsthead, llink) {
-
- if (tncur->wq == wq ||
- (tncur->task == this_task && ++wake_nests > EP_MAX_POLLWAKE_NESTS)) {
+ if (tncur->task == this_task &&
+ (tncur->cookie == cookie || ++call_nests > max_nests)) {
/*
* Ops ... loop detected or maximum nest level reached.
* We abort this wake by breaking the cycle itself.
*/
- spin_unlock_irqrestore(&psw->lock, flags);
- return;
+ spin_unlock_irqrestore(&ncalls->lock, flags);
+ return -1;
}
}

/* Add the current task to the list */
tnode.task = this_task;
- tnode.wq = wq;
+ tnode.cookie = cookie;
list_add(&tnode.llink, lsthead);

- spin_unlock_irqrestore(&psw->lock, flags);
+ spin_unlock_irqrestore(&ncalls->lock, flags);

- /* Do really wake up now */
- wake_up_nested(wq, 1 + wake_nests);
+ /* Call the nested function */
+ error = (*nproc)(priv, cookie, call_nests);

/* Remove the current task from the list */
- spin_lock_irqsave(&psw->lock, flags);
+ spin_lock_irqsave(&ncalls->lock, flags);
list_del(&tnode.llink);
- spin_unlock_irqrestore(&psw->lock, flags);
+ spin_unlock_irqrestore(&ncalls->lock, flags);
+
+ return error;
+}
+
+static int ep_poll_wakeup_proc(void *priv, void *cookie, int call_nests)
+{
+ wake_up_nested((wait_queue_head_t *) cookie, 1 + call_nests);
+ return 0;
+}
+
+/*
+ * Perform a safe wake up of the poll wait list. The problem is that
+ * with the new callback'd wake up system, it is possible that the
+ * poll callback is reentered from inside the call to wake_up() done
+ * on the poll wait queue head. The rule is that we cannot reenter the
+ * wake up code from the same task more than EP_MAX_NESTS times,
+ * and we cannot reenter the same wait queue head at all. This will
+ * enable to have a hierarchy of epoll file descriptor of no more than
+ * EP_MAX_NESTS deep.
+ */
+static void ep_poll_safewake(wait_queue_head_t *wq)
+{
+ ep_call_nested(&poll_safewake_ncalls, EP_MAX_NESTS,
+ ep_poll_wakeup_proc, NULL, wq);
}

/*
@@ -407,6 +444,104 @@
}
}

+/**
+ * ep_scan_ready_list - Scans the ready list in a way that makes possible for
+ * the scan code, to call f_op->poll(). Also allows for
+ * O(NumReady) performance.
+ *
+ * @ep: Pointer to the epoll private data structure.
+ * @sproc: Pointer to the scan callback.
+ * @priv: Private opaque data passed to the @sproc callback.
+ *
+ * Returns: The same integer error code returned by the @sproc callback.
+ */
+static int ep_scan_ready_list(struct eventpoll *ep,
+ int (*sproc)(struct eventpoll *,
+ struct list_head *, void *),
+ void *priv)
+{
+ int error, pwake = 0;
+ unsigned long flags;
+ struct epitem *epi, *nepi;
+ struct list_head txlist;
+
+ INIT_LIST_HEAD(&txlist);
+
+ /*
+ * We need to lock this because we could be hit by
+ * eventpoll_release_file() and epoll_ctl(EPOLL_CTL_DEL).
+ */
+ mutex_lock(&ep->mtx);
+
+ /*
+ * Steal the ready list, and re-init the original one to the
+ * empty list. Also, set ep->ovflist to NULL so that events
+ * happening while looping w/out locks, are not lost. We cannot
+ * have the poll callback to queue directly on ep->rdllist,
+ * because we want the "sproc" callback to be able to do it
+ * in a lockless way.
+ */
+ spin_lock_irqsave(&ep->lock, flags);
+ list_splice(&ep->rdllist, &txlist);
+ INIT_LIST_HEAD(&ep->rdllist);
+ ep->ovflist = NULL;
+ spin_unlock_irqrestore(&ep->lock, flags);
+
+ /*
+ * Now call the callback function.
+ */
+ error = (*sproc)(ep, &txlist, priv);
+
+ spin_lock_irqsave(&ep->lock, flags);
+ /*
+ * During the time we spent inside the "sproc" callback, some
+ * other events might have been queued by the poll callback.
+ * We re-insert them inside the main ready-list here.
+ */
+ for (nepi = ep->ovflist; (epi = nepi) != NULL;
+ nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
+ /*
+ * We need to check if the item is already in the list.
+ * During the "sproc" callback execution time, items are
+ * queued into ->ovflist but the "txlist" might already
+ * contain them, and the list_splice() below takes care of them.
+ */
+ if (!ep_is_linked(&epi->rdllink))
+ list_add_tail(&epi->rdllink, &ep->rdllist);
+ }
+ /*
+ * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
+ * releasing the lock, events will be queued in the normal way inside
+ * ep->rdllist.
+ */
+ ep->ovflist = EP_UNACTIVE_PTR;
+
+ /*
+ * Quickly re-inject items left on "txlist".
+ */
+ list_splice(&txlist, &ep->rdllist);
+
+ if (!list_empty(&ep->rdllist)) {
+ /*
+ * Wake up (if active) both the eventpoll wait list and the ->poll()
+ * wait list (delayed after we release the lock).
+ */
+ if (waitqueue_active(&ep->wq))
+ wake_up_locked(&ep->wq);
+ if (waitqueue_active(&ep->poll_wait))
+ pwake++;
+ }
+ spin_unlock_irqrestore(&ep->lock, flags);
+
+ mutex_unlock(&ep->mtx);
+
+ /* We have to call this outside the lock */
+ if (pwake)
+ ep_poll_safewake(&ep->poll_wait);
+
+ return error;
+}
+
/*
* Removes a "struct epitem" from the eventpoll RB tree and deallocates
* all the associated resources. Must be called with "mtx" held.
@@ -457,7 +592,7 @@

/* We need to release all tasks waiting for these file */
if (waitqueue_active(&ep->poll_wait))
- ep_poll_safewake(&psw, &ep->poll_wait);
+ ep_poll_safewake(&ep->poll_wait);

/*
* We need to lock this because we could be hit by
@@ -507,22 +642,44 @@
return 0;
}

+static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, void *priv)
+{
+ struct epitem *epi, *tmp;
+
+ list_for_each_entry_safe(epi, tmp, head, rdllink) {
+ if (epi->ffd.file->f_op->poll(epi->ffd.file, NULL) &
+ epi->event.events)
+ return POLLIN | POLLRDNORM;
+ else
+ list_del_init(&epi->rdllink);
+ }
+
+ return 0;
+}
+
+static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests)
+{
+ return ep_scan_ready_list(priv, ep_read_events_proc, NULL);
+}
+
static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait)
{
- unsigned int pollflags = 0;
- unsigned long flags;
+ int pollflags;
struct eventpoll *ep = file->private_data;

/* Insert inside our poll wait queue */
poll_wait(file, &ep->poll_wait, wait);

- /* Check our condition */
- spin_lock_irqsave(&ep->lock, flags);
- if (!list_empty(&ep->rdllist))
- pollflags = POLLIN | POLLRDNORM;
- spin_unlock_irqrestore(&ep->lock, flags);
+ /*
+ * Proceed to find out if wanted events are really available inside
+ * the ready list. This need to be done under ep_call_nested()
+ * supervision, since the call to f_op->poll() done on listed files
+ * could re-enter here.
+ */
+ pollflags = ep_call_nested(&poll_readywalk_ncalls, EP_MAX_NESTS,
+ ep_poll_readyevents_proc, ep, ep);

- return pollflags;
+ return pollflags != -1 ? pollflags: 0;
}

/* File callbacks that implement the eventpoll file behaviour */
@@ -703,7 +860,7 @@

/* We have to call this outside the lock */
if (pwake)
- ep_poll_safewake(&psw, &ep->poll_wait);
+ ep_poll_safewake(&ep->poll_wait);

return 1;
}
@@ -830,7 +987,7 @@

/* We have to call this outside the lock */
if (pwake)
- ep_poll_safewake(&psw, &ep->poll_wait);
+ ep_poll_safewake(&ep->poll_wait);

DNPRINTK(3, (KERN_INFO "[%p] eventpoll: ep_insert(%p, %p, %d)\n",
current, ep, tfile, fd));
@@ -904,137 +1061,74 @@

/* We have to call this outside the lock */
if (pwake)
- ep_poll_safewake(&psw, &ep->poll_wait);
+ ep_poll_safewake(&ep->poll_wait);

return 0;
}

-static int ep_send_events(struct eventpoll *ep, struct epoll_event __user *events,
- int maxevents)
+static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head, void *priv)
{
- int eventcnt, error = -EFAULT, pwake = 0;
- unsigned int revents;
- unsigned long flags;
- struct epitem *epi, *nepi;
- struct list_head txlist;
-
- INIT_LIST_HEAD(&txlist);
-
- /*
- * We need to lock this because we could be hit by
- * eventpoll_release_file() and epoll_ctl(EPOLL_CTL_DEL).
- */
- mutex_lock(&ep->mtx);
-
- /*
- * Steal the ready list, and re-init the original one to the
- * empty list. Also, set ep->ovflist to NULL so that events
- * happening while looping w/out locks, are not lost. We cannot
- * have the poll callback to queue directly on ep->rdllist,
- * because we are doing it in the loop below, in a lockless way.
- */
- spin_lock_irqsave(&ep->lock, flags);
- list_splice(&ep->rdllist, &txlist);
- INIT_LIST_HEAD(&ep->rdllist);
- ep->ovflist = NULL;
- spin_unlock_irqrestore(&ep->lock, flags);
-
- /*
- * We can loop without lock because this is a task private list.
- * We just splice'd out the ep->rdllist in ep_collect_ready_items().
- * Items cannot vanish during the loop because we are holding "mtx".
- */
- for (eventcnt = 0; !list_empty(&txlist) && eventcnt < maxevents;) {
- epi = list_first_entry(&txlist, struct epitem, rdllink);
-
- list_del_init(&epi->rdllink);
-
- /*
- * Get the ready file event set. We can safely use the file
- * because we are holding the "mtx" and this will guarantee
- * that both the file and the item will not vanish.
- */
- revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL);
- revents &= epi->event.events;
-
- /*
- * Is the event mask intersect the caller-requested one,
- * deliver the event to userspace. Again, we are holding
- * "mtx", so no operations coming from userspace can change
- * the item.
- */
- if (revents) {
- if (__put_user(revents,
- &events[eventcnt].events) ||
- __put_user(epi->event.data,
- &events[eventcnt].data))
- goto errxit;
- if (epi->event.events & EPOLLONESHOT)
- epi->event.events &= EP_PRIVATE_BITS;
- eventcnt++;
- }
- /*
- * At this point, noone can insert into ep->rdllist besides
- * us. The epoll_ctl() callers are locked out by us holding
- * "mtx" and the poll callback will queue them in ep->ovflist.
- */
- if (!(epi->event.events & EPOLLET) &&
- (revents & epi->event.events))
- list_add_tail(&epi->rdllink, &ep->rdllist);
- }
- error = 0;
-
-errxit:
-
- spin_lock_irqsave(&ep->lock, flags);
- /*
- * During the time we spent in the loop above, some other events
- * might have been queued by the poll callback. We re-insert them
- * inside the main ready-list here.
- */
- for (nepi = ep->ovflist; (epi = nepi) != NULL;
- nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
- /*
- * If the above loop quit with errors, the epoll item might still
- * be linked to "txlist", and the list_splice() done below will
- * take care of those cases.
- */
- if (!ep_is_linked(&epi->rdllink))
- list_add_tail(&epi->rdllink, &ep->rdllist);
- }
- /*
- * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
- * releasing the lock, events will be queued in the normal way inside
- * ep->rdllist.
- */
- ep->ovflist = EP_UNACTIVE_PTR;
+ struct ep_send_events_data *esed = priv;
+ int eventcnt;
+ unsigned int revents;
+ struct epitem *epi;
+ struct epoll_event __user *uevent;

- /*
- * In case of error in the event-send loop, or in case the number of
- * ready events exceeds the userspace limit, we need to splice the
- * "txlist" back inside ep->rdllist.
- */
- list_splice(&txlist, &ep->rdllist);
+ /*
+ * We can loop without lock because we are passed a task private list.
+ * Items cannot vanish during the loop because ep_scan_ready_list() is
+ * holding "mtx" during this call.
+ */
+ for (eventcnt = 0, uevent = esed->events;
+ !list_empty(head) && eventcnt < esed->maxevents;) {
+ epi = list_first_entry(head, struct epitem, rdllink);
+
+ list_del_init(head);
+
+ revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL) &
+ epi->event.events;
+
+ /*
+ * If the event mask intersect the caller-requested one,
+ * deliver the event to userspace. Again, ep_scan_ready_list()
+ * is holding "mtx", so no operations coming from userspace
+ * can change the item.
+ */
+ if (revents) {
+ if (__put_user(revents, &uevent->events) ||
+ __put_user(epi->event.data, &uevent->data))
+ return eventcnt ? eventcnt: -EFAULT;
+ eventcnt++;
+ uevent++;
+ if (epi->event.events & EPOLLONESHOT)
+ epi->event.events &= EP_PRIVATE_BITS;
+ else if (!(epi->event.events & EPOLLET))
+ /*
+ * If this file has been added with Level Trigger
+ * mode, we need to insert back inside the ready
+ * list, so that the next call to epoll_wait()
+ * will check again the events availability.
+ * At this point, noone can insert into ep->rdllist
+ * besides us. The epoll_ctl() callers are locked
+ * out by ep_scan_ready_list() holding "mtx" and
+ * the poll callback will queue them in ep->ovflist.
+ */
+ list_add_tail(&epi->rdllink, &ep->rdllist);
+ }
+ }

- if (!list_empty(&ep->rdllist)) {
- /*
- * Wake up (if active) both the eventpoll wait list and the ->poll()
- * wait list (delayed after we release the lock).
- */
- if (waitqueue_active(&ep->wq))
- wake_up_locked(&ep->wq);
- if (waitqueue_active(&ep->poll_wait))
- pwake++;
- }
- spin_unlock_irqrestore(&ep->lock, flags);
+ return eventcnt;
+}

- mutex_unlock(&ep->mtx);
+static int ep_send_events(struct eventpoll *ep, struct epoll_event __user *events,
+ int maxevents)
+{
+ struct ep_send_events_data esed;

- /* We have to call this outside the lock */
- if (pwake)
- ep_poll_safewake(&psw, &ep->poll_wait);
+ esed.maxevents = maxevents;
+ esed.events = events;

- return eventcnt == 0 ? error: eventcnt;
+ return ep_scan_ready_list(ep, ep_send_events_proc, &esed);
}

static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
@@ -1046,7 +1140,7 @@
wait_queue_t wait;

/*
- * Calculate the timeout by checking for the "infinite" value ( -1 )
+ * Calculate the timeout by checking for the "infinite" value (-1)
* and the overflow condition. The passed timeout is in milliseconds,
* that why (t * HZ) / 1000.
*/
@@ -1112,42 +1206,42 @@
*/
SYSCALL_DEFINE1(epoll_create1, int, flags)
{
- int error, fd = -1;
- struct eventpoll *ep;
+ int error;
+ struct eventpoll *ep = NULL;

/* Check the EPOLL_* constant for consistency. */
BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);

- if (flags & ~EPOLL_CLOEXEC)
- return -EINVAL;
-
DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_create(%d)\n",
current, flags));

+ error = -EINVAL;
+ if (flags & ~EPOLL_CLOEXEC)
+ goto error_return;
+
/*
- * Create the internal data structure ( "struct eventpoll" ).
+ * Create the internal data structure ("struct eventpoll").
*/
error = ep_alloc(&ep);
- if (error < 0) {
- fd = error;
+ if (error < 0)
goto error_return;
- }

/*
* Creates all the items needed to setup an eventpoll file. That is,
* a file structure and a free file descriptor.
*/
- fd = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep,
- flags & O_CLOEXEC);
- if (fd < 0)
+ error = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep,
+ flags & O_CLOEXEC);
+ if (error < 0)
ep_free(ep);
- atomic_inc(&ep->user->epoll_devs);
+ else
+ atomic_inc(&ep->user->epoll_devs);

error_return:
DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_create(%d) = %d\n",
- current, flags, fd));
+ current, flags, error));

- return fd;
+ return error;
}

SYSCALL_DEFINE1(epoll_create, int, size)
@@ -1371,7 +1465,10 @@
EP_ITEM_COST;

/* Initialize the structure used to perform safe poll wait head wake ups */
- ep_poll_safewake_init(&psw);
+ ep_nested_calls_init(&poll_safewake_ncalls);
+
+ /* Initialize the structure used to perform file's f_op->poll() calls */
+ ep_nested_calls_init(&poll_readywalk_ncalls);

/* Allocates slab cache used to allocate "struct epitem" items */
epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),

Attachments:

epoll_test.c (7.38 kB)

2009-01-25 21:02:40

by Davide Libenzi

[permalink] [raw]

Subject: Re: Unexpected cascaded epoll behavior - my mistake or kernel bug

On Sat, 24 Jan 2009, Davide Libenzi wrote:

> On Fri, 23 Jan 2009, Pavel Pisa wrote:
>
> > Hello Davide and all others,
> >
> > I have got to implementing yet another event library and I experience
> > something strange when level triggered epoll_wait() monitors another
> > level triggered epoll set. Top level epoll_wait() return information
> > about one event pending. The event points correctly to the monitored
> > second level epoll fd, but epoll_wait() on this fd with 0 or small timeout
> > returns zero/no events and top level epoll reporting is not reset at this
> > point so program enters busy loop. If there are events, they are processed
> > correctly. More interresting is, that the ill behavior is prevented if I
> > change number of debug writes into unrelated fd or if the application
> > is slowed down by strace to standard output. The strace to the
> > file does not (fortunatelly) hide ill behavior to the strange behavior
> > of my code or kernel can be documented in attached straces. The both
> > traces documents the busy loop cause for case, where there is created
> > epoll set with fd #3, that set monitors fd #0. When character arrives
> > at fd #0, the new epoll fd set with fd #4 is created ad fd #3 is added
> > into this new top level fd set.
> >
> > epoll_create(8) = 3
> > epoll_ctl(3, EPOLL_CTL_ADD, 0, {EPOLLIN, {u32=6353264, u64=6353264}}) = 0
> > epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 20000) = 1
> > read(0, "E"..., 1) = 1
> > ...
> > epoll_create(8) = 4
> > epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
> > epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
> > epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 0) = 1
> > read(0, "\n"..., 1) = 1
> > epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
> > epoll_wait(3, {}, 8, 0) = 0
> > epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
> > epoll_wait(3, {}, 8, 0) = 0
> >
> > You may ask, why that terrible clam starts to writting new event library code
> > and why he needs to cascade event handling mechanisms. I try to defend myself
> > a little. If you do not like beat me, you can skip next section
>
> Pavel, can you give the patch below a spin? It's working fine in my two
> boxes, but it'd be great if you could confirm it too.
> Attached there's also a simple test program to check epoll behavior,
> especially the one related to recursion.

Please use the attached epoll_test.c since the one I sent you before
wasn't actually triggering the loop condition it wanted to check.

- Davide

Attachments:

epoll_test.c (8.30 kB)

2009-01-26 01:29:18

by Pavel Pisa

[permalink] [raw]

Subject: Re: Unexpected cascaded epoll behavior - my mistake or kernel bug

On Saturday 24 January 2009 22:58:45 Davide Libenzi wrote:
> On Fri, 23 Jan 2009, Pavel Pisa wrote:
> > Hello Davide and all others,
> >
> > I have got to implementing yet another event library and I experience
> > something strange when level triggered epoll_wait() monitors another
> > level triggered epoll set. Top level epoll_wait() return information
> > about one event pending. The event points correctly to the monitored
> > second level epoll fd, but epoll_wait() on this fd with 0 or small
> > timeout returns zero/no events and top level epoll reporting is not reset
> > at this point so program enters busy loop. If there are events, they are
> > processed correctly. More interresting is, that the ill behavior is
> > prevented if I change number of debug writes into unrelated fd or if the
> > application is slowed down by strace to standard output. The strace to
> > the
> > file does not (fortunatelly) hide ill behavior to the strange behavior
> > of my code or kernel can be documented in attached straces. The both
> > traces documents the busy loop cause for case, where there is created
> > epoll set with fd #3, that set monitors fd #0. When character arrives
> > at fd #0, the new epoll fd set with fd #4 is created ad fd #3 is added
> > into this new top level fd set.
> >
> > epoll_create(8) = 3
> > epoll_ctl(3, EPOLL_CTL_ADD, 0, {EPOLLIN, {u32=6353264, u64=6353264}}) = 0
> > epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 20000) = 1
> > read(0, "E"..., 1) = 1
> > ...
> > epoll_create(8) = 4
> > epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
> > epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
> > epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 0) = 1
> > read(0, "\n"..., 1) = 1
> > epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
> > epoll_wait(3, {}, 8, 0) = 0
> > epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
> > epoll_wait(3, {}, 8, 0) = 0
> >
> > You may ask, why that terrible clam starts to writting new event library
> > code and why he needs to cascade event handling mechanisms. I try to
> > defend myself a little. If you do not like beat me, you can skip next
> > section
>
> Pavel, can you give the patch below a spin? It's working fine in my two
> boxes, but it'd be great if you could confirm it too.
> Attached there's also a simple test program to check epoll behavior,
> especially the one related to recursion.
>

Hello Davide,

thanks for fast reply and effort.

I have tested your patch with patched and unpatched 2.6.28.2 kernel.
The outputs are attached. The patched kernel passes all your tests.

But I have bad news. My library code does not fall into busy loop
after your patching but my FIFO tests do not work even in
single level epoll scenario. The strace output is attached as well.
I try to do more testing tomorrow. I have returned from weekend trip
at the evening today and I have not much time for deeper analysis.

But it looks like write level sensitive events are not triggering
for epoll at all. The pipe is not fill by any character and specified
timeout is triggered with next message as fail results.
@ pipe #X evptrig wr timeout
@ pipe #X evptrig rd timeout

If you want to test the code on your box, download library

http://cmp.felk.cvut.cz/~pisa/ulan/ul_evpoll-090123.tar.gz

The simple "make" in the top directory should work.

Then you can test correct behavior with sysvpoll default events
monitoring. Run program "_compiled/bin-utils/ul_evpchk".
If you pres "p" once or multiple times and then Enter, you
should see, that FD 0 receives your characters

<5>@ fd #0 evptrig 1 count 1
'p'

Then there should be 10x burs series of 50 pipe writes and reads.
Each series is separated by short timed delay.

The print of last few reads and writes looks like this

<5>@ pipe #3 rd evptrig 1 count wr 499 rd 498
<5>@ pipe #4 wr evptrig 2 count wr 500 rd 499
<5>***** ul_evpoll_dispatch returned 1 *****
<5>***** ul_evpoll_dispatch called *****
<5>@ pipe #3 rd evptrig 1 count wr 500 rd 499
<5>@ pipe #3 rd final count reached
<4>@ pipe rd #3 wr #4 destroyed
<3>@ pipe active count reached 0 in 20131

The important is the last line, which confirms, that all FIFOs
has reached final count and that there is 0 active pipes now.
The last number informs about time in ms spent before
all pipes has finished. There are some distracting prints
about timers 1,2,3 to test timers support in parallel.
If you like to get rid of these, you can comment out
or disable timer lines in

ulevloop/ul_evpchk.c

- if(1) {
+ if(0) {
add_my_timer(2000, 0);
add_my_timer(3000, 20);
add_my_timer(1400, 10);
}

You can check the code with epoll event monitoring mechanism
then

_compiled/bin-utils/ul_evpchk lnxepoll

It works same way (correctly) without patch.
You can try to glib support as well - surprisingly by parameter "glib".

You can start with "lnxepoll" again and then pres "C"+Enter
to cascade epoll in glib main loop. Again worked correctly in past.
If you cascade by "E"+Enter, then cascading of epoll into epoll
is requested and it failed in default setup. If the output
has been redirected, then it worked correctly in some situations.

The other tests can be invoked by another letters
"t" .. run 1000 active FDs in parallel
"T" .. run 10000 active FDs in parallel
"l" .. create 1000 inactive FDs as load to epoll
"L" .. created 8000 inactive FDs

The all these has been working with all mechanisms before patching,
even with cascading except epoll at epoll cascade.
These huge tests switch low debug level so only errors
and final reports are seen after test letter.
The glib based tests for 10000 FDs is horribly long, so I have
not found patience to run it at full.

There are some tests results for different tests and mechanisms.
Each test is invoked by initial command line. The first number is number
of active FDs (500 events scheduled for each), the second number
of inactive FDs, the third is mechanism (cascade means epoll cascaded
on glib), the last number is time in ms.
All test has been run on Core 2 Duo T7300 @ 2.00GHz

t

1000 0 sysvpoll 1369
1000 0 epoll 1570
1000 0 glib 3046
1000 0 cascaded 1783

tttttttttt

10000 0 sysvpoll 17774
10000 0 epoll 20750
10000 0 glib 445341
10000 0 cascaded 23098

tltltltltltltltltltl

10000 10000 sysvpoll 18924
10000 10000 epoll 20748
10000 10000 glib ?
10000 10000 cascaded 23370

llltllltll

2000 8000 sysvpoll 4014
2000 8000 epoll 3205
2000 8000 glib 87357
2000 8000 cascaded 3718

Thanks again for testing and if you have some time and interrest,
you can try my kernel torture on your boxes,

Pavel

Attachments:

(No filename) (6.70 kB)
epoll_test-2.6.28-orig.out (3.40 kB)
epoll_test-2.6.28-patched.out (3.36 kB)
ul_evpchk-lnxepoll-2.6.28.2-patched.strace (10.75 kB)
Download all attachments

2009-01-26 19:55:07

by Davide Libenzi

[permalink] [raw]

Subject: Re: Unexpected cascaded epoll behavior - my mistake or kernel bug

On Mon, 26 Jan 2009, Pavel Pisa wrote:

> Hello Davide,
>
> thanks for fast reply and effort.
>
> I have tested your patch with patched and unpatched 2.6.28.2 kernel.
> The outputs are attached. The patched kernel passes all your tests.
>
> But I have bad news. My library code does not fall into busy loop
> after your patching but my FIFO tests do not work even in
> single level epoll scenario. The strace output is attached as well.
> I try to do more testing tomorrow. I have returned from weekend trip
> at the evening today and I have not much time for deeper analysis.
>
> But it looks like write level sensitive events are not triggering
> for epoll at all. The pipe is not fill by any character and specified
> timeout is triggered with next message as fail results.
> @ pipe #X evptrig wr timeout
> @ pipe #X evptrig rd timeout
>
> If you want to test the code on your box, download library
>
> http://cmp.felk.cvut.cz/~pisa/ulan/ul_evpoll-090123.tar.gz
>
> The simple "make" in the top directory should work.

It'd be really great if you could spare me having to dig into few
thousands lines of userspace library code, and you could isolate a little
bit better what is the expected result, and the error result.

- Davide

2009-01-26 20:36:18

by Davide Libenzi

[permalink] [raw]

Subject: Re: Unexpected cascaded epoll behavior - my mistake or kernel bug

On Mon, 26 Jan 2009, Davide Libenzi wrote:

> On Mon, 26 Jan 2009, Pavel Pisa wrote:
>
> > Hello Davide,
> >
> > thanks for fast reply and effort.
> >
> > I have tested your patch with patched and unpatched 2.6.28.2 kernel.
> > The outputs are attached. The patched kernel passes all your tests.
> >
> > But I have bad news. My library code does not fall into busy loop
> > after your patching but my FIFO tests do not work even in
> > single level epoll scenario. The strace output is attached as well.
> > I try to do more testing tomorrow. I have returned from weekend trip
> > at the evening today and I have not much time for deeper analysis.
> >
> > But it looks like write level sensitive events are not triggering
> > for epoll at all. The pipe is not fill by any character and specified
> > timeout is triggered with next message as fail results.
> > @ pipe #X evptrig wr timeout
> > @ pipe #X evptrig rd timeout
> >
> > If you want to test the code on your box, download library
> >
> > http://cmp.felk.cvut.cz/~pisa/ulan/ul_evpoll-090123.tar.gz
> >
> > The simple "make" in the top directory should work.
>
> It'd be really great if you could spare me having to dig into few
> thousands lines of userspace library code, and you could isolate a little
> bit better what is the expected result, and the error result.

Never mind, I looked myself and I'm able to replicate it with the simple
attached test program. Looking into it ...

- Davide

Attachments:

epoll_pipe_pingpong.c (2.69 kB)

2009-01-26 21:04:54

by Davide Libenzi

[permalink] [raw]

Subject: Re: Unexpected cascaded epoll behavior - my mistake or kernel bug

On Mon, 26 Jan 2009, Davide Libenzi wrote:

> On Mon, 26 Jan 2009, Davide Libenzi wrote:
>
> > On Mon, 26 Jan 2009, Pavel Pisa wrote:
> >
> > > Hello Davide,
> > >
> > > thanks for fast reply and effort.
> > >
> > > I have tested your patch with patched and unpatched 2.6.28.2 kernel.
> > > The outputs are attached. The patched kernel passes all your tests.
> > >
> > > But I have bad news. My library code does not fall into busy loop
> > > after your patching but my FIFO tests do not work even in
> > > single level epoll scenario. The strace output is attached as well.
> > > I try to do more testing tomorrow. I have returned from weekend trip
> > > at the evening today and I have not much time for deeper analysis.
> > >
> > > But it looks like write level sensitive events are not triggering
> > > for epoll at all. The pipe is not fill by any character and specified
> > > timeout is triggered with next message as fail results.
> > > @ pipe #X evptrig wr timeout
> > > @ pipe #X evptrig rd timeout
> > >
> > > If you want to test the code on your box, download library
> > >
> > > http://cmp.felk.cvut.cz/~pisa/ulan/ul_evpoll-090123.tar.gz
> > >
> > > The simple "make" in the top directory should work.
> >
> > It'd be really great if you could spare me having to dig into few
> > thousands lines of userspace library code, and you could isolate a little
> > bit better what is the expected result, and the error result.
>
> Never mind, I looked myself and I'm able to replicate it with the simple
> attached test program. Looking into it ...

Alright, found it. Both mine and your test programs works fine now.

- Davide

---
fs/eventpoll.c | 510 +++++++++++++++++++++++++++++++++------------------------
1 file changed, 304 insertions(+), 206 deletions(-)

Index: linux-2.6.mod/fs/eventpoll.c
===================================================================
--- linux-2.6.mod.orig/fs/eventpoll.c 2009-01-24 10:06:52.000000000 -0800
+++ linux-2.6.mod/fs/eventpoll.c 2009-01-26 12:56:12.000000000 -0800
@@ -1,6 +1,6 @@
/*
- * fs/eventpoll.c (Efficent event polling implementation)
- * Copyright (C) 2001,...,2007 Davide Libenzi
+ * fs/eventpoll.c (Efficient event retrieval implementation)
+ * Copyright (C) 2001,...,2009 Davide Libenzi
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
@@ -92,8 +92,8 @@
/* Epoll private bits inside the event mask */
#define EP_PRIVATE_BITS (EPOLLONESHOT | EPOLLET)

-/* Maximum number of poll wake up nests we are allowing */
-#define EP_MAX_POLLWAKE_NESTS 4
+/* Maximum number of nesting allowed inside epoll sets */
+#define EP_MAX_NESTS 4

/* Maximum msec timeout value storeable in a long int */
#define EP_MAX_MSTIMEO min(1000ULL * MAX_SCHEDULE_TIMEOUT / HZ, (LONG_MAX - 999ULL) / HZ)
@@ -110,24 +110,21 @@
};

/*
- * Node that is linked into the "wake_task_list" member of the "struct poll_safewake".
- * It is used to keep track on all tasks that are currently inside the wake_up() code
- * to 1) short-circuit the one coming from the same task and same wait queue head
- * (loop) 2) allow a maximum number of epoll descriptors inclusion nesting
- * 3) let go the ones coming from other tasks.
+ * Structure used to track possible nested calls, for too deep recursions
+ * and loop cycles.
*/
-struct wake_task_node {
+struct nested_call_node {
struct list_head llink;
struct task_struct *task;
- wait_queue_head_t *wq;
+ void *cookie;
};

/*
- * This is used to implement the safe poll wake up avoiding to reenter
- * the poll callback from inside wake_up().
+ * This structure is used as collector for nested calls, to check for
+ * maximum recursion dept and loop cycles.
*/
-struct poll_safewake {
- struct list_head wake_task_list;
+struct nested_calls {
+ struct list_head tasks_call_list;
spinlock_t lock;
};

@@ -231,6 +228,12 @@
struct epitem *epi;
};

+/* Used by the ep_send_events() function as callback private data */
+struct ep_send_events_data {
+ int maxevents;
+ struct epoll_event __user *events;
+};
+
/*
* Configuration options available inside /proc/sys/fs/epoll/
*/
@@ -244,8 +247,11 @@
*/
static DEFINE_MUTEX(epmutex);

-/* Safe wake up implementation */
-static struct poll_safewake psw;
+/* Used for safe wake up implementation */
+static struct nested_calls poll_safewake_ncalls;
+
+/* Used to call file's f_op->poll() under the nested calls boundaries */
+static struct nested_calls poll_readywalk_ncalls;

/* Slab cache used to allocate "struct epitem" */
static struct kmem_cache *epi_cache __read_mostly;
@@ -322,64 +328,96 @@
}

/* Initialize the poll safe wake up structure */
-static void ep_poll_safewake_init(struct poll_safewake *psw)
+static void ep_nested_calls_init(struct nested_calls *ncalls)
{
-
- INIT_LIST_HEAD(&psw->wake_task_list);
- spin_lock_init(&psw->lock);
+ INIT_LIST_HEAD(&ncalls->tasks_call_list);
+ spin_lock_init(&ncalls->lock);
}

-/*
- * Perform a safe wake up of the poll wait list. The problem is that
- * with the new callback'd wake up system, it is possible that the
- * poll callback is reentered from inside the call to wake_up() done
- * on the poll wait queue head. The rule is that we cannot reenter the
- * wake up code from the same task more than EP_MAX_POLLWAKE_NESTS times,
- * and we cannot reenter the same wait queue head at all. This will
- * enable to have a hierarchy of epoll file descriptor of no more than
- * EP_MAX_POLLWAKE_NESTS deep. We need the irq version of the spin lock
- * because this one gets called by the poll callback, that in turn is called
- * from inside a wake_up(), that might be called from irq context.
+/**
+ * ep_call_nested - Perform a bound (possibly) nested call, by checking
+ * that the recursion limit is not exceeded, and that
+ * the same nested call (by the meaning of same cookie) is
+ * no re-entered.
+ *
+ * @ncalls: Pointer to the nested_calls structure to be used for this call.
+ * @max_nests: Maximum number of allowed nesting calls.
+ * @nproc: Nested call core function pointer.
+ * @priv: Opaque data to be passed to the @nproc callback.
+ * @cookie: Cookie to be used to identify this nested call.
+ *
+ * Returns: Returns the code returned by the @nproc callback, or -1 if
+ * the maximum recursion limit has been exceeded.
*/
-static void ep_poll_safewake(struct poll_safewake *psw, wait_queue_head_t *wq)
+static int ep_call_nested(struct nested_calls *ncalls, int max_nests,
+ int (*nproc)(void *, void *, int), void *priv,
+ void *cookie)
{
- int wake_nests = 0;
+ int error, call_nests = 0;
unsigned long flags;
struct task_struct *this_task = current;
- struct list_head *lsthead = &psw->wake_task_list;
- struct wake_task_node *tncur;
- struct wake_task_node tnode;
+ struct list_head *lsthead = &ncalls->tasks_call_list;
+ struct nested_call_node *tncur;
+ struct nested_call_node tnode;

- spin_lock_irqsave(&psw->lock, flags);
+ spin_lock_irqsave(&ncalls->lock, flags);

- /* Try to see if the current task is already inside this wakeup call */
+ /*
+ * Try to see if the current task is already inside this wakeup call.
+ * We use a list here, since the population inside this set is always
+ * very much limited.
+ */
list_for_each_entry(tncur, lsthead, llink) {
-
- if (tncur->wq == wq ||
- (tncur->task == this_task && ++wake_nests > EP_MAX_POLLWAKE_NESTS)) {
+ if (tncur->task == this_task &&
+ (tncur->cookie == cookie || ++call_nests > max_nests)) {
/*
* Ops ... loop detected or maximum nest level reached.
* We abort this wake by breaking the cycle itself.
*/
- spin_unlock_irqrestore(&psw->lock, flags);
- return;
+ spin_unlock_irqrestore(&ncalls->lock, flags);
+
+ return -1;
}
}

- /* Add the current task to the list */
+ /* Add the current task and cookie to the list */
tnode.task = this_task;
- tnode.wq = wq;
+ tnode.cookie = cookie;
list_add(&tnode.llink, lsthead);

- spin_unlock_irqrestore(&psw->lock, flags);
+ spin_unlock_irqrestore(&ncalls->lock, flags);

- /* Do really wake up now */
- wake_up_nested(wq, 1 + wake_nests);
+ /* Call the nested function */
+ error = (*nproc)(priv, cookie, call_nests);

/* Remove the current task from the list */
- spin_lock_irqsave(&psw->lock, flags);
+ spin_lock_irqsave(&ncalls->lock, flags);
list_del(&tnode.llink);
- spin_unlock_irqrestore(&psw->lock, flags);
+ spin_unlock_irqrestore(&ncalls->lock, flags);
+
+ return error;
+}
+
+static int ep_poll_wakeup_proc(void *priv, void *cookie, int call_nests)
+{
+ wake_up_nested((wait_queue_head_t *) cookie, 1 + call_nests);
+ return 0;
+}
+
+/*
+ * Perform a safe wake up of the poll wait list. The problem is that
+ * with the new callback'd wake up system, it is possible that the
+ * poll callback is reentered from inside the call to wake_up() done
+ * on the poll wait queue head. The rule is that we cannot reenter the
+ * wake up code from the same task more than EP_MAX_NESTS times,
+ * and we cannot reenter the same wait queue head at all. This will
+ * enable to have a hierarchy of epoll file descriptor of no more than
+ * EP_MAX_NESTS deep.
+ */
+static void ep_poll_safewake(wait_queue_head_t *wq)
+{
+ ep_call_nested(&poll_safewake_ncalls, EP_MAX_NESTS,
+ ep_poll_wakeup_proc, NULL, wq);
}

/*
@@ -407,6 +445,104 @@
}
}

+/**
+ * ep_scan_ready_list - Scans the ready list in a way that makes possible for
+ * the scan code, to call f_op->poll(). Also allows for
+ * O(NumReady) performance.
+ *
+ * @ep: Pointer to the epoll private data structure.
+ * @sproc: Pointer to the scan callback.
+ * @priv: Private opaque data passed to the @sproc callback.
+ *
+ * Returns: The same integer error code returned by the @sproc callback.
+ */
+static int ep_scan_ready_list(struct eventpoll *ep,
+ int (*sproc)(struct eventpoll *,
+ struct list_head *, void *),
+ void *priv)
+{
+ int error, pwake = 0;
+ unsigned long flags;
+ struct epitem *epi, *nepi;
+ struct list_head txlist;
+
+ INIT_LIST_HEAD(&txlist);
+
+ /*
+ * We need to lock this because we could be hit by
+ * eventpoll_release_file() and epoll_ctl(EPOLL_CTL_DEL).
+ */
+ mutex_lock(&ep->mtx);
+
+ /*
+ * Steal the ready list, and re-init the original one to the
+ * empty list. Also, set ep->ovflist to NULL so that events
+ * happening while looping w/out locks, are not lost. We cannot
+ * have the poll callback to queue directly on ep->rdllist,
+ * because we want the "sproc" callback to be able to do it
+ * in a lockless way.
+ */
+ spin_lock_irqsave(&ep->lock, flags);
+ list_splice(&ep->rdllist, &txlist);
+ INIT_LIST_HEAD(&ep->rdllist);
+ ep->ovflist = NULL;
+ spin_unlock_irqrestore(&ep->lock, flags);
+
+ /*
+ * Now call the callback function.
+ */
+ error = (*sproc)(ep, &txlist, priv);
+
+ spin_lock_irqsave(&ep->lock, flags);
+ /*
+ * During the time we spent inside the "sproc" callback, some
+ * other events might have been queued by the poll callback.
+ * We re-insert them inside the main ready-list here.
+ */
+ for (nepi = ep->ovflist; (epi = nepi) != NULL;
+ nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
+ /*
+ * We need to check if the item is already in the list.
+ * During the "sproc" callback execution time, items are
+ * queued into ->ovflist but the "txlist" might already
+ * contain them, and the list_splice() below takes care of them.
+ */
+ if (!ep_is_linked(&epi->rdllink))
+ list_add_tail(&epi->rdllink, &ep->rdllist);
+ }
+ /*
+ * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
+ * releasing the lock, events will be queued in the normal way inside
+ * ep->rdllist.
+ */
+ ep->ovflist = EP_UNACTIVE_PTR;
+
+ /*
+ * Quickly re-inject items left on "txlist".
+ */
+ list_splice(&txlist, &ep->rdllist);
+
+ if (!list_empty(&ep->rdllist)) {
+ /*
+ * Wake up (if active) both the eventpoll wait list and the ->poll()
+ * wait list (delayed after we release the lock).
+ */
+ if (waitqueue_active(&ep->wq))
+ wake_up_locked(&ep->wq);
+ if (waitqueue_active(&ep->poll_wait))
+ pwake++;
+ }
+ spin_unlock_irqrestore(&ep->lock, flags);
+
+ mutex_unlock(&ep->mtx);
+
+ /* We have to call this outside the lock */
+ if (pwake)
+ ep_poll_safewake(&ep->poll_wait);
+
+ return error;
+}
+
/*
* Removes a "struct epitem" from the eventpoll RB tree and deallocates
* all the associated resources. Must be called with "mtx" held.
@@ -457,7 +593,7 @@

/* We need to release all tasks waiting for these file */
if (waitqueue_active(&ep->poll_wait))
- ep_poll_safewake(&psw, &ep->poll_wait);
+ ep_poll_safewake(&ep->poll_wait);

/*
* We need to lock this because we could be hit by
@@ -507,22 +643,49 @@
return 0;
}

+static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, void *priv)
+{
+ struct epitem *epi, *tmp;
+
+ list_for_each_entry_safe(epi, tmp, head, rdllink) {
+ if (epi->ffd.file->f_op->poll(epi->ffd.file, NULL) &
+ epi->event.events)
+ return POLLIN | POLLRDNORM;
+ else
+ /*
+ * Item has been dropped into the ready list by the poll
+ * callback, but it's not actually ready, as far as
+ * caller requested events goes. We can remove it here.
+ */
+ list_del_init(&epi->rdllink);
+ }
+
+ return 0;
+}
+
+static int ep_poll_readyevents_proc(void *priv, void *cookie, int call_nests)
+{
+ return ep_scan_ready_list(priv, ep_read_events_proc, NULL);
+}
+
static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait)
{
- unsigned int pollflags = 0;
- unsigned long flags;
+ int pollflags;
struct eventpoll *ep = file->private_data;

/* Insert inside our poll wait queue */
poll_wait(file, &ep->poll_wait, wait);

- /* Check our condition */
- spin_lock_irqsave(&ep->lock, flags);
- if (!list_empty(&ep->rdllist))
- pollflags = POLLIN | POLLRDNORM;
- spin_unlock_irqrestore(&ep->lock, flags);
+ /*
+ * Proceed to find out if wanted events are really available inside
+ * the ready list. This need to be done under ep_call_nested()
+ * supervision, since the call to f_op->poll() done on listed files
+ * could re-enter here.
+ */
+ pollflags = ep_call_nested(&poll_readywalk_ncalls, EP_MAX_NESTS,
+ ep_poll_readyevents_proc, ep, ep);

- return pollflags;
+ return pollflags != -1 ? pollflags: 0;
}

/* File callbacks that implement the eventpoll file behaviour */
@@ -552,7 +715,7 @@
* We don't want to get "file->f_ep_lock" because it is not
* necessary. It is not necessary because we're in the "struct file"
* cleanup path, and this means that noone is using this file anymore.
- * So, for example, epoll_ctl() cannot hit here sicne if we reach this
+ * So, for example, epoll_ctl() cannot hit here since if we reach this
* point, the file counter already went to zero and fget() would fail.
* The only hit might come from ep_free() but by holding the mutex
* will correctly serialize the operation. We do need to acquire
@@ -683,12 +846,9 @@
}

/* If this file is already in the ready list we exit soon */
- if (ep_is_linked(&epi->rdllink))
- goto is_linked;
-
- list_add_tail(&epi->rdllink, &ep->rdllist);
+ if (!ep_is_linked(&epi->rdllink))
+ list_add_tail(&epi->rdllink, &ep->rdllist);

-is_linked:
/*
* Wake up ( if active ) both the eventpoll wait list and the ->poll()
* wait list.
@@ -703,7 +863,7 @@

/* We have to call this outside the lock */
if (pwake)
- ep_poll_safewake(&psw, &ep->poll_wait);
+ ep_poll_safewake(&ep->poll_wait);

return 1;
}
@@ -725,10 +885,9 @@
add_wait_queue(whead, &pwq->wait);
list_add_tail(&pwq->llink, &epi->pwqlist);
epi->nwait++;
- } else {
+ } else
/* We have to signal that an error occurred */
epi->nwait = -1;
- }
}

static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)
@@ -830,7 +989,7 @@

/* We have to call this outside the lock */
if (pwake)
- ep_poll_safewake(&psw, &ep->poll_wait);
+ ep_poll_safewake(&ep->poll_wait);

DNPRINTK(3, (KERN_INFO "[%p] eventpoll: ep_insert(%p, %p, %d)\n",
current, ep, tfile, fd));
@@ -904,137 +1063,74 @@

/* We have to call this outside the lock */
if (pwake)
- ep_poll_safewake(&psw, &ep->poll_wait);
+ ep_poll_safewake(&ep->poll_wait);

return 0;
}

-static int ep_send_events(struct eventpoll *ep, struct epoll_event __user *events,
- int maxevents)
+static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head, void *priv)
{
- int eventcnt, error = -EFAULT, pwake = 0;
- unsigned int revents;
- unsigned long flags;
- struct epitem *epi, *nepi;
- struct list_head txlist;
-
- INIT_LIST_HEAD(&txlist);
-
- /*
- * We need to lock this because we could be hit by
- * eventpoll_release_file() and epoll_ctl(EPOLL_CTL_DEL).
- */
- mutex_lock(&ep->mtx);
-
- /*
- * Steal the ready list, and re-init the original one to the
- * empty list. Also, set ep->ovflist to NULL so that events
- * happening while looping w/out locks, are not lost. We cannot
- * have the poll callback to queue directly on ep->rdllist,
- * because we are doing it in the loop below, in a lockless way.
- */
- spin_lock_irqsave(&ep->lock, flags);
- list_splice(&ep->rdllist, &txlist);
- INIT_LIST_HEAD(&ep->rdllist);
- ep->ovflist = NULL;
- spin_unlock_irqrestore(&ep->lock, flags);
+ struct ep_send_events_data *esed = priv;
+ int eventcnt;
+ unsigned int revents;
+ struct epitem *epi;
+ struct epoll_event __user *uevent;

- /*
- * We can loop without lock because this is a task private list.
- * We just splice'd out the ep->rdllist in ep_collect_ready_items().
- * Items cannot vanish during the loop because we are holding "mtx".
- */
- for (eventcnt = 0; !list_empty(&txlist) && eventcnt < maxevents;) {
- epi = list_first_entry(&txlist, struct epitem, rdllink);
+ /*
+ * We can loop without lock because we are passed a task private list.
+ * Items cannot vanish during the loop because ep_scan_ready_list() is
+ * holding "mtx" during this call.
+ */
+ for (eventcnt = 0, uevent = esed->events;
+ !list_empty(head) && eventcnt < esed->maxevents;) {
+ epi = list_first_entry(head, struct epitem, rdllink);

list_del_init(&epi->rdllink);

- /*
- * Get the ready file event set. We can safely use the file
- * because we are holding the "mtx" and this will guarantee
- * that both the file and the item will not vanish.
- */
- revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL);
- revents &= epi->event.events;
+ revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL) &
+ epi->event.events;

- /*
- * Is the event mask intersect the caller-requested one,
- * deliver the event to userspace. Again, we are holding
- * "mtx", so no operations coming from userspace can change
- * the item.
- */
- if (revents) {
- if (__put_user(revents,
- &events[eventcnt].events) ||
- __put_user(epi->event.data,
- &events[eventcnt].data))
- goto errxit;
- if (epi->event.events & EPOLLONESHOT)
- epi->event.events &= EP_PRIVATE_BITS;
- eventcnt++;
- }
- /*
- * At this point, noone can insert into ep->rdllist besides
- * us. The epoll_ctl() callers are locked out by us holding
- * "mtx" and the poll callback will queue them in ep->ovflist.
- */
- if (!(epi->event.events & EPOLLET) &&
- (revents & epi->event.events))
- list_add_tail(&epi->rdllink, &ep->rdllist);
- }
- error = 0;
-
-errxit:
+ /*
+ * If the event mask intersect the caller-requested one,
+ * deliver the event to userspace. Again, ep_scan_ready_list()
+ * is holding "mtx", so no operations coming from userspace
+ * can change the item.
+ */
+ if (revents) {
+ if (__put_user(revents, &uevent->events) ||
+ __put_user(epi->event.data, &uevent->data))
+ return eventcnt ? eventcnt: -EFAULT;
+ eventcnt++;
+ uevent++;
+ if (epi->event.events & EPOLLONESHOT)
+ epi->event.events &= EP_PRIVATE_BITS;
+ else if (!(epi->event.events & EPOLLET))
+ /*
+ * If this file has been added with Level Trigger
+ * mode, we need to insert back inside the ready
+ * list, so that the next call to epoll_wait()
+ * will check again the events availability.
+ * At this point, noone can insert into ep->rdllist
+ * besides us. The epoll_ctl() callers are locked
+ * out by ep_scan_ready_list() holding "mtx" and
+ * the poll callback will queue them in ep->ovflist.
+ */
+ list_add_tail(&epi->rdllink, &ep->rdllist);
+ }
+ }

- spin_lock_irqsave(&ep->lock, flags);
- /*
- * During the time we spent in the loop above, some other events
- * might have been queued by the poll callback. We re-insert them
- * inside the main ready-list here.
- */
- for (nepi = ep->ovflist; (epi = nepi) != NULL;
- nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
- /*
- * If the above loop quit with errors, the epoll item might still
- * be linked to "txlist", and the list_splice() done below will
- * take care of those cases.
- */
- if (!ep_is_linked(&epi->rdllink))
- list_add_tail(&epi->rdllink, &ep->rdllist);
- }
- /*
- * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
- * releasing the lock, events will be queued in the normal way inside
- * ep->rdllist.
- */
- ep->ovflist = EP_UNACTIVE_PTR;
-
- /*
- * In case of error in the event-send loop, or in case the number of
- * ready events exceeds the userspace limit, we need to splice the
- * "txlist" back inside ep->rdllist.
- */
- list_splice(&txlist, &ep->rdllist);
-
- if (!list_empty(&ep->rdllist)) {
- /*
- * Wake up (if active) both the eventpoll wait list and the ->poll()
- * wait list (delayed after we release the lock).
- */
- if (waitqueue_active(&ep->wq))
- wake_up_locked(&ep->wq);
- if (waitqueue_active(&ep->poll_wait))
- pwake++;
- }
- spin_unlock_irqrestore(&ep->lock, flags);
+ return eventcnt;
+}

- mutex_unlock(&ep->mtx);
+static int ep_send_events(struct eventpoll *ep, struct epoll_event __user *events,
+ int maxevents)
+{
+ struct ep_send_events_data esed;

- /* We have to call this outside the lock */
- if (pwake)
- ep_poll_safewake(&psw, &ep->poll_wait);
+ esed.maxevents = maxevents;
+ esed.events = events;

- return eventcnt == 0 ? error: eventcnt;
+ return ep_scan_ready_list(ep, ep_send_events_proc, &esed);
}

static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
@@ -1046,7 +1142,7 @@
wait_queue_t wait;

/*
- * Calculate the timeout by checking for the "infinite" value ( -1 )
+ * Calculate the timeout by checking for the "infinite" value (-1)
* and the overflow condition. The passed timeout is in milliseconds,
* that why (t * HZ) / 1000.
*/
@@ -1089,9 +1185,8 @@

set_current_state(TASK_RUNNING);
}
-
/* Is it worth to try to dig for events ? */
- eavail = !list_empty(&ep->rdllist);
+ eavail = !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR;

spin_unlock_irqrestore(&ep->lock, flags);

@@ -1112,42 +1207,42 @@
*/
SYSCALL_DEFINE1(epoll_create1, int, flags)
{
- int error, fd = -1;
- struct eventpoll *ep;
+ int error;
+ struct eventpoll *ep = NULL;

/* Check the EPOLL_* constant for consistency. */
BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);

- if (flags & ~EPOLL_CLOEXEC)
- return -EINVAL;
-
DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_create(%d)\n",
current, flags));

+ error = -EINVAL;
+ if (flags & ~EPOLL_CLOEXEC)
+ goto error_return;
+
/*
- * Create the internal data structure ( "struct eventpoll" ).
+ * Create the internal data structure ("struct eventpoll").
*/
error = ep_alloc(&ep);
- if (error < 0) {
- fd = error;
+ if (error < 0)
goto error_return;
- }

/*
* Creates all the items needed to setup an eventpoll file. That is,
* a file structure and a free file descriptor.
*/
- fd = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep,
- flags & O_CLOEXEC);
- if (fd < 0)
+ error = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep,
+ flags & O_CLOEXEC);
+ if (error < 0)
ep_free(ep);
- atomic_inc(&ep->user->epoll_devs);
+ else
+ atomic_inc(&ep->user->epoll_devs);

error_return:
DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_create(%d) = %d\n",
- current, flags, fd));
+ current, flags, error));

- return fd;
+ return error;
}

SYSCALL_DEFINE1(epoll_create, int, size)
@@ -1371,7 +1466,10 @@
EP_ITEM_COST;

/* Initialize the structure used to perform safe poll wait head wake ups */
- ep_poll_safewake_init(&psw);
+ ep_nested_calls_init(&poll_safewake_ncalls);
+
+ /* Initialize the structure used to perform file's f_op->poll() calls */
+ ep_nested_calls_init(&poll_readywalk_ncalls);

/* Allocates slab cache used to allocate "struct epitem" items */
epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),

2009-01-27 00:24:56

by Pavel Pisa

[permalink] [raw]

Subject: Re: Unexpected cascaded epoll behavior - my mistake or kernel bug

On Monday 26 January 2009 22:04:05 Davide Libenzi wrote:
> On Mon, 26 Jan 2009, Davide Libenzi wrote:
> > On Mon, 26 Jan 2009, Davide Libenzi wrote:
> > > On Mon, 26 Jan 2009, Pavel Pisa wrote:
> > > > Hello Davide,
> > > >
> > > > thanks for fast reply and effort.
> > > >
> > > > I have tested your patch with patched and unpatched 2.6.28.2 kernel.
> > > > The outputs are attached. The patched kernel passes all your tests.
> > > >
> > > > But I have bad news. My library code does not fall into busy loop
> > > > after your patching but my FIFO tests do not work even in
> > > > single level epoll scenario. The strace output is attached as well.
> > > > I try to do more testing tomorrow. I have returned from weekend trip
> > > > at the evening today and I have not much time for deeper analysis.
> > > >
> > > > But it looks like write level sensitive events are not triggering
> > > > for epoll at all. The pipe is not fill by any character and specified
> > > > timeout is triggered with next message as fail results.
> > > > @ pipe #X evptrig wr timeout
> > > > @ pipe #X evptrig rd timeout
> > > >
> > > > If you want to test the code on your box, download library
> > > >
> > > > http://cmp.felk.cvut.cz/~pisa/ulan/ul_evpoll-090123.tar.gz
> > > >
> > > > The simple "make" in the top directory should work.
> > >
> > > It'd be really great if you could spare me having to dig into few
> > > thousands lines of userspace library code, and you could isolate a
> > > little bit better what is the expected result, and the error result.
> >
> > Never mind, I looked myself and I'm able to replicate it with the simple
> > attached test program. Looking into it ...
>
> Alright, found it. Both mine and your test programs works fine now.

Hello Davide,

thanks much for fast testing and patches. I have run different combination
of the yours and mine tests on 2.6.28.2 patched by your latest fix
and all have passed.

Excuse me for long code. I have not been fast enough to prepare simpler
test. I have tried to log straces documenting the problem at least.
There has been another problem with test reduction, that behavior
has been timing dependant. Your pipe ping pong test works OK
on unpatched 2.6.28.2 for some reasons for example. The mine code
changed behavior according to log level and output redirection.
My updated version of code does not trigger problem as well. Only that
archived version has been relatively "reliable" in problem exposing.

I am running patched kernel on my laptop to test it in daily use now.

There are some questions to you (if you find time to reply):

1) is the original problem only exposed by epoll over epoll?
If yes, then I expect, that I can use epoll over poll (glib)
even on older kernels.

2) If there could be other scenario to invoke event stuck on unpatched
kernel, what does exist some operation with epoll set gets event
reports into sync? I would add it as workaround into library.

3) the epoll with level triggered events is most simple as poll replacement,
but EPOLLONESHOT and EPOLLET could cause less overhead on
the kernel side. Have you some idea about expected throughput
change?

Thanks much again,

Pavel

2009-01-27 01:03:28

by Davide Libenzi

[permalink] [raw]

Subject: Re: Unexpected cascaded epoll behavior - my mistake or kernel bug

On Tue, 27 Jan 2009, Pavel Pisa wrote:

> Hello Davide,
>
> thanks much for fast testing and patches. I have run different combination
> of the yours and mine tests on 2.6.28.2 patched by your latest fix
> and all have passed.
>
> Excuse me for long code. I have not been fast enough to prepare simpler
> test. I have tried to log straces documenting the problem at least.
> There has been another problem with test reduction, that behavior
> has been timing dependant. Your pipe ping pong test works OK
> on unpatched 2.6.28.2 for some reasons for example. The mine code
> changed behavior according to log level and output redirection.
> My updated version of code does not trigger problem as well. Only that
> archived version has been relatively "reliable" in problem exposing.

The epoll pipe ping-pong only fails in the first version patch I sent you.
It was a bug in the patch. Older kernel, and the ones using the new
version of the patch are fine, WRT that test.

> 1) is the original problem only exposed by epoll over epoll?
> If yes, then I expect, that I can use epoll over poll (glib)
> even on older kernels.

No. Epoll's poll() was broken, that means you could not add the epoll fd
inside any container (poll and select). Well, you can, but you get
spurious wakeups.

> 2) If there could be other scenario to invoke event stuck on unpatched
> kernel, what does exist some operation with epoll set gets event
> reports into sync? I would add it as workaround into library.

Look at how the epoll_test.c program I sent you, does.

> 3) the epoll with level triggered events is most simple as poll replacement,
> but EPOLLONESHOT and EPOLLET could cause less overhead on
> the kernel side. Have you some idea about expected throughput
> change?

It all depends on the application. EPOLLET departs from other event fetch
machanisms, so if you have a generic library, you may want to be careful
in its usage.

- Davide