I am getting the above message in 2.5.51, 52, and 52+bk current.
Pci info follows:
oscar# lspci -vv
00:00.0 Host bridge: VIA Technologies, Inc. VT82C598 [Apollo MVP3] (rev 04)
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR+
Latency: 16
Region 0: Memory at e0000000 (32-bit, prefetchable) [size=64M]
Capabilities: [a0] AGP version 1.0
Status: RQ=7 SBA+ 64bit- FW- Rate=x1,x2
Command: RQ=0 SBA- AGP- 64bit- FW- Rate=<none>
with a mga G400
01:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G400 AGP (rev 04) (prog-if 00 [VGA])
Subsystem: Matrox Graphics, Inc. Millennium G400 MAX/Dual Head 32Mb
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (4000ns min, 8000ns max), cache line size 08
Interrupt: pin A routed to IRQ 11
Region 0: Memory at e8000000 (32-bit, prefetchable) [size=32M]
Region 1: Memory at e4000000 (32-bit, non-prefetchable) [size=16K]
Region 2: Memory at e5000000 (32-bit, non-prefetchable) [size=8M]
Expansion ROM at <unassigned> [disabled] [size=64K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [f0] AGP version 2.0
Status: RQ=31 SBA+ 64bit- FW- Rate=x1,x2
Command: RQ=31 SBA+ AGP+ 64bit- FW- Rate=x1
What else would help to debug this? The drm error above is all I find in the logs...
TIA
Ed Tomlinson
Replying to Ed Tomlinson:
> I am getting the above message in 2.5.51, 52, and 52+bk current.
> Pci info follows:
> What else would help to debug this? The drm error above is all I find in the logs...
If you mount devfs somewhere you also don't find misc/agpgart inside ?
:)))
And nothing about agp aperture in dmesg?
--
Paul P 'Stingray' Komkoff 'Greatest' Jr /// (icq)23200764 /// (http)stingr.net
When you're invisible, the only one really watching you is you (my keychain)
On Mon, Dec 16, 2002 at 08:49:16PM -0500, Ed Tomlinson wrote:
> I am getting the above message in 2.5.51, 52, and 52+bk current.
> Pci info follows:
>
> What else would help to debug this? The drm error above is all I find in the logs...
There are a bunch of pending fixes at bk://linux-dj.bkbits.net/agpgart,
but nothing that should be relevant to this problem.
Are you using agpgart as modules? Which ones loaded ?
I've a feeling agpgart.ko loaded here, but not via-agp.ko
What needs to happen is when agpgart.ko is loaded, all the
chipset drivers also get pulled in as dependancies.
Since the new module stuff went in, I'm not sure how this
works, if it works at all.[1]
Dave
[1] I'm yet another developer who has had a rough time with the
new modules stuff. I'll try it again soon.
--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs
Paul P Komkoff Jr wrote:
> Replying to Ed Tomlinson:
>> I am getting the above message in 2.5.51, 52, and 52+bk current.
>> Pci info follows:
>> What else would help to debug this? The drm error above is all I find in
>> the logs...
>
> If you mount devfs somewhere you also don't find misc/agpgart inside ?
> :)))
>
> And nothing about agp aperture in dmesg?
Not normally. If I modprobe via-agp modprobe segfaults (a Rusty's bug),
but via_agp and agpgart get loaded (note that - changed to _ when the module
is loaded - it has dash in file in the directory). Doing it this time gets
an oops (52bk as of last night):
Linux agpgart interface v0.100 (c) Dave Jones
agpgart: Detected VIA MVP3 chipset
Unable to handle kernel paging request at virtual address e0db9080
printing eip:
e0db9080
*pde = 1ed12067
*pte = 00000000
Oops: 0000
CPU: 0
EIP: 0060:[<e0db9080>] Not tainted
EFLAGS: 00010297
EIP is at 0xe0db9080
eax: 00000000 ebx: c15d5800 ecx: c0291b68 edx: 00000282
esi: 00000000 edi: c15d584c ebp: c15d5800 esp: d93e5f08
ds: 0068 es: 0068 ss: 0068
Process modprobe (pid: 19122, threadinfo=d93e4000 task=df09aca0)
Stack: e0de816f c15d5800 c15d5800 e0dc81a1 c15d5800 e0dc84a8 c019f5c8
c15d5800
e0dcd2ac c15d584c e0dc84a8 ffffffed e0dc84a8 e0dc8480 c01ac0af
c15d584c
c15d584c c15d5854 c02bd934 c01ac183 c15d584c e0dc84a8 c02bd840
e0dc83b8
Call Trace:
[<e0de816f>] agp_register_driver+0x27/0x9c [agpgart]
[<e0dc81a1>] agp_via_probe+0x35/0x3c [via_agp]
[<e0dc84a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
[<c019f5c8>] pci_device_probe+0x40/0x5c
[<e0dcd2ac>] agp_via_pci_table+0x0/0x38 [via_agp]
[<e0dc84a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
[<e0dc84a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
[<e0dc8480>] agp_via_pci_driver+0x0/0xa0 [via_agp]
[<c01ac0af>] bus_match+0x37/0x6c
[<c01ac183>] driver_attach+0x37/0x60
[<e0dc84a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
[<e0dc83b8>] __module_pci_device_size+0x10/0x18 [via_agp]
[<e0dc84a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
[<c01ac44c>] bus_add_driver+0xa4/0xc4
[<e0dc84a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
[<e0dc84c8>] agp_via_pci_driver+0x48/0xa0 [via_agp]
[<c01ac83c>] driver_register+0x34/0x38
[<e0dc84a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
[<c019f6c2>] pci_register_driver+0x42/0x50
[<e0dc84a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
[<e0dcd1af>] agp_via_init+0xb/0x44 [via_agp]
[<e0dc8480>] agp_via_pci_driver+0x0/0xa0 [via_agp]
[<c0125752>] sys_init_module+0x116/0x1a4
[<c0108937>] syscall_call+0x7/0xb
Code: Bad EIP value.
The module load occured after X was started.
Ed Tomlinson
On December 18, 2002 04:46 am, Rusty Russell wrote:
> In message <[email protected]> you write:
> > On Wed, Dec 18, 2002 at 06:07:48PM +1100, Rusty Russell wrote:
> > > Dave, it's true that it's my fault, but I'm afraid it looks like your
> > > bug 8). This is most likely karma for not reporting bugs in the
> > > module code when you had problems 8)
> >
> > Bah, my bad. 8)
>
> That's OK, you're in good company as the first I heard from Alan was
> that he gave up kernel work because "modules was so broken". 8(
>
> > > static int __init agp_backend_initialize(struct pci_dev *dev)
> > >
> > > ....
> > > int agp_register_driver (struct pci_dev *dev)
> > > {
> > > ....
> > > ret_val = agp_backend_initialize(dev);
> >
> > Whoops. Thanks, fixed up here.
> > I'll bet I duped that bug in other places too, as I've rejigged
> > stuff around. I'll double check those paths in a mo.
> >
> > > Ed, does it work if you take all the __init out of the agp code?
> >
> > My moneys on it working. The oops looked like it was jumping to oblivion
> > when it called agp_backend_initialize. So the new modutils discards
> > __init sections ? That's a new feature isn't it ?
>
> Yeah, kind of a bonus. It's actually arch-dependent.
>
> I've added an item in my TODO list to check for relocations on the
> non-init section which point into the init sections. It'd be cute.
> It'll probably never go into the main kernel, and it's no actually an
> error, you can imagine code which does:
>
> if (initializing)
> some_init_func();
Dave when you have this in a bk tree let me know and I will pull and
verify it working here.
Thanks to both of you
Ed Tomlinson
On Wed, Dec 18, 2002 at 07:57:53AM -0500, Ed Tomlinson wrote:
> > > > Ed, does it work if you take all the __init out of the agp code?
> > > My moneys on it working. The oops looked like it was jumping to oblivion
> > > when it called agp_backend_initialize.
> Dave when you have this in a bk tree let me know and I will pull and
> verify it working here.
bk://linux-dj.bkbits.net/agpgart
I've given it a compile testing, but not booted it yet.
Scream if necessary.
Dave
--
| Dave Jones. http://www.codemonkey.org.uk
On Dec 17 Ed Tomlinson wrote:
>Not normally. If I modprobe via-agp modprobe segfaults (a Rusty's bug),
>but via_agp and agpgart get loaded (note that - changed to _ when the module
>is loaded - it has dash in file in the directory). Doing it this time gets
>an oops (52bk as of last night):
[snip]
I get a very similar oops, but with amd_k7_agp (2.5.52-mm2). I'm not
bk-savvy as yet, but if pointed at a diff, would be happy to verify it.
Matt
On Fri, 20 Dec 2002, Matt Bernstein wrote:
| On Dec 17 Ed Tomlinson wrote:
|
| >Not normally. If I modprobe via-agp modprobe segfaults (a Rusty's bug),
| >but via_agp and agpgart get loaded (note that - changed to _ when the module
| >is loaded - it has dash in file in the directory). Doing it this time gets
| >an oops (52bk as of last night):
| [snip]
|
| I get a very similar oops, but with amd_k7_agp (2.5.52-mm2). I'm not
| bk-savvy as yet, but if pointed at a diff, would be happy to verify it.
2.5.zz kernel diff snapshots (from bk) are available at
http://www.kernel.org/pub/linux/kernel/v2.5/snapshots/
e.g., latest is:
http://www.kernel.org/pub/linux/kernel/v2.5/snapshots/patch-2.5.52-bk4.bz2
--
~Randy
On Thu, Dec 19, 2002 at 05:43:14PM -0800, Randy.Dunlap wrote:
> |
> | I get a very similar oops, but with amd_k7_agp (2.5.52-mm2). I'm not
> | bk-savvy as yet, but if pointed at a diff, would be happy to verify it.
>
> 2.5.zz kernel diff snapshots (from bk) are available at
> http://www.kernel.org/pub/linux/kernel/v2.5/snapshots/
> e.g., latest is:
> http://www.kernel.org/pub/linux/kernel/v2.5/snapshots/patch-2.5.52-bk4.bz2
Latest AGP bits aren't in Linus tree yet. A few more bits to nail
down, and then I'll ask him to pull again.
Dave
--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs
On December 19, 2002 05:59 am, Dave Jones wrote:
> On Wed, Dec 18, 2002 at 06:03:23PM -0500, Ed Tomlinson wrote:
> > both with and without agp 3.0 enabled I get:
> >
> > Dec 18 17:51:10 oscar kernel: Linux agpgart interface v0.100 (c) Dave
> > Jones Dec 18 17:51:29 oscar kernel: via_agp: Unknown symbol
> > agp_generic_agp_3_0_enable
>
> I don't get this. Can you mail me your .config ?
Dave, with the pull from this morning (8am EST), it almost works modular.
I get:
Dec 20 18:20:19 oscar upsd[636]: Communication established
Dec 20 18:20:47 oscar kernel: Linux agpgart interface v0.100 (c) Dave Jones
Dec 20 18:20:47 oscar kernel: agpgart: Detected VIA MVP3 chipset
Dec 20 18:20:47 oscar kernel: agpgart: AGP aperture is 64M @ 0xe0000000
Dec 20 18:20:58 oscar kernel: [drm] Initialized mga 3.1.0 20021029 on minor 0
Dec 20 18:20:58 oscar kernel: Module agpgart cannot be unloaded due to unsafe usage in drivers/char/ag
p/backend.c:58
but find this in the X startup log.
(EE) MGA: Failed to load module "mga_hal" (module does not exist, 0)
(EE) MGA(0): [agp] Out of memory (-1014)
(EE) MGA(0): [drm] failed to remove DRM signal handler
DRIUnlock called when not locked
ideas?
Ed Tomlinson
On Fri, Dec 20, 2002 at 06:29:18PM -0500, Ed Tomlinson wrote:
> Dave, with the pull from this morning (8am EST), it almost works modular.
> I get:
>
> Dec 20 18:20:19 oscar upsd[636]: Communication established
> Dec 20 18:20:47 oscar kernel: Linux agpgart interface v0.100 (c) Dave Jones
> Dec 20 18:20:47 oscar kernel: agpgart: Detected VIA MVP3 chipset
> Dec 20 18:20:47 oscar kernel: agpgart: AGP aperture is 64M @ 0xe0000000
> Dec 20 18:20:58 oscar kernel: [drm] Initialized mga 3.1.0 20021029 on minor 0
> Dec 20 18:20:58 oscar kernel: Module agpgart cannot be unloaded due to unsafe usage in drivers/char/ag
> p/backend.c:58
This one is due to the way AGPGART does (or has done for the last 3
years) its module locking. It does a MOD_INC_USE_COUNT as soon as
someone calls the acquire routines. (So you can't unload agpgart
whilst you've a 3d using app (like X) open).
This seems quite sensible, but these days you can't unload agpgart.ko
anyway because the chipset module (via-agp.ko in your case) already
has it 'in use', so I'm tempted to drop those bits.
> but find this in the X startup log.
> (EE) MGA: Failed to load module "mga_hal" (module does not exist, 0)
That's matrox's binary only X blob. Not my fault.
> (EE) MGA(0): [agp] Out of memory (-1014)
This one is. But it may be a knock-on effect from the bug above.
I'll nail that one first.
> (EE) MGA(0): [drm] failed to remove DRM signal handler
> DRIUnlock called when not locked
That one's a problem for the DRI folks.
Dave
--
| Dave Jones. http://www.codemonkey.org.uk
On December 21, 2002 09:22 am, Dave Jones wrote:
> On Fri, Dec 20, 2002 at 06:29:18PM -0500, Ed Tomlinson wrote:
> > but find this in the X startup log.
> > (EE) MGA: Failed to load module "mga_hal" (module does not exist, 0)
>
> That's matrox's binary only X blob. Not my fault.
Know all about this one... My G400 runs just fine without with X setups.
> > (EE) MGA(0): [agp] Out of memory (-1014)
>
> This one is. But it may be a knock-on effect from the bug above.
> I'll nail that one first.
Thanks
> > (EE) MGA(0): [drm] failed to remove DRM signal handler
> > DRIUnlock called when not locked
>
> That one's a problem for the DRI folks.
Yep
Now for something new. With bk current (6pm EST) I get:
Dec 21 23:30:56 oscar kernel: Linux agpgart interface v0.100 (c) Dave Jones
Dec 21 23:30:56 oscar kernel: agpgart: Detected VIA MVP3 chipset
Dec 21 23:30:56 oscar kernel: printing eip:
Dec 21 23:30:56 oscar kernel: e0db9000
Dec 21 23:30:56 oscar kernel: Oops: 0000
Dec 21 23:30:56 oscar kernel: CPU: 0
Dec 21 23:30:56 oscar kernel: EIP: 0060:[<e0db9000>] Not tainted
Dec 21 23:30:56 oscar kernel: EFLAGS: 00010296
Dec 21 23:30:56 oscar kernel: EIP is at 0xe0db9000
Dec 21 23:30:56 oscar kernel: eax: 00000000 ebx: c15d8800 ecx: 000000a4 edx: 07000203
Dec 21 23:30:56 oscar kernel: esi: 00000000 edi: c15d884c ebp: c9fc3ee0 esp: c9fc3ec8
Dec 21 23:30:56 oscar kernel: ds: 0068 es: 0068 ss: 0068
Dec 21 23:30:56 oscar kernel: Process modprobe (pid: 2175, threadinfo=c9fc2000 task=d60e92a0)
Dec 21 23:30:56 oscar kernel: Stack: e0de9108 c15d8800 00000000 c15d884c 00000000 00000000 c9fc3ef0 e0de92f8
Dec 21 23:30:56 oscar kernel: c15d8800 c15d8800 c9fc3f04 e0dd035a c15d8800 e0dcb4a8 a0dcb4a8 c9fc3f28
Dec 21 23:30:56 oscar kernel: c019d252 c15d8800 e0dd046c c15d884c e0dcb4a8 ffffffed c15d8800 e0dcb480
Dec 21 23:30:56 oscar kernel: Call Trace:
Dec 21 23:30:56 oscar kernel: [<e0de9108>] agp_backend_initialize+0x1c/0x168 [agpgart]
Dec 21 23:30:56 oscar kernel: [<e0de92f8>] agp_register_driver+0x2c/0xac [agpgart]
Dec 21 23:30:56 oscar kernel: [<e0dd035a>] agp_via_probe+0x62/0x6c [via_agp]
Dec 21 23:30:56 oscar kernel: [<e0dcb4a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
Dec 21 23:30:56 oscar kernel: [pci_device_probe+70/96] pci_device_probe+0x46/0x60
Dec 21 23:30:56 oscar kernel: [<e0dd046c>] agp_via_pci_table+0x0/0x38 [via_agp]
Dec 21 23:30:56 oscar kernel: [<e0dcb4a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
Dec 21 23:30:56 oscar kernel: [<e0dcb480>] agp_via_pci_driver+0x0/0xa0 [via_agp]
Dec 21 23:30:56 oscar kernel: [bus_match+56/108] bus_match+0x38/0x6c
Dec 21 23:30:56 oscar kernel: [driver_attach+66/108] driver_attach+0x42/0x6c
Dec 21 23:30:56 oscar kernel: [<e0dcb4a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
Dec 21 23:30:56 oscar kernel: [<e0dcb4a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
Dec 21 23:30:56 oscar kernel: [bus_add_driver+172/204] bus_add_driver+0xac/0xcc
Dec 21 23:30:56 oscar kernel: [<e0dcb4a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
Dec 21 23:30:56 oscar kernel: [<e0dcb4c8>] agp_via_pci_driver+0x48/0xa0 [via_agp]
Dec 21 23:30:56 oscar kernel: [driver_register+54/56] driver_register+0x36/0x38
Dec 21 23:30:56 oscar kernel: [<e0dcb4a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
Dec 21 23:30:56 oscar kernel: [pci_register_driver+68/80] pci_register_driver+0x44/0x50
Dec 21 23:30:56 oscar kernel: [<e0dcb4a8>] agp_via_pci_driver+0x28/0xa0 [via_agp]
Dec 21 23:30:56 oscar kernel: [<e0dd0372>] agp_via_init+0xe/0x44 [via_agp]
Dec 21 23:30:56 oscar kernel: [<e0dcb480>] agp_via_pci_driver+0x0/0xa0 [via_agp]
Dec 21 23:30:56 oscar kernel: [sys_init_module+274/408] sys_init_module+0x112/0x198
Dec 21 23:30:56 oscar kernel: [syscall_call+7/11] syscall_call+0x7/0xb
Dec 21 23:30:56 oscar kernel:
Dec 21 23:30:56 oscar kernel: Code: Bad EIP value.
when modprobing via_agp
Ed
On Sat, Dec 21, 2002 at 11:41:29PM -0500, Ed Tomlinson wrote:
>
> Now for something new. With bk current (6pm EST) I get:
> Dec 21 23:30:56 oscar kernel: Call Trace:
> Dec 21 23:30:56 oscar kernel: [<e0de9108>] agp_backend_initialize+0x1c/0x168 [agpgart]
> Dec 21 23:30:56 oscar kernel: [<e0de92f8>] agp_register_driver+0x2c/0xac [agpgart]
I already fixed a bug with the same call-trace. This looks like
you've still got old .o files around. Can you make clean and rebuild
just to make sure ?
Dave
--
| Dave Jones. http://www.codemonkey.org.uk
On December 22, 2002 07:28 am, Dave Jones wrote:
> On Sat, Dec 21, 2002 at 11:41:29PM -0500, Ed Tomlinson wrote:
> > Now for something new. With bk current (6pm EST) I get:
> > Dec 21 23:30:56 oscar kernel: Call Trace:
> > Dec 21 23:30:56 oscar kernel: [<e0de9108>]
> > agp_backend_initialize+0x1c/0x168 [agpgart] Dec 21 23:30:56 oscar
> > kernel: [<e0de92f8>] agp_register_driver+0x2c/0xac [agpgart]
>
> I already fixed a bug with the same call-trace. This looks like
> you've still got old .o files around. Can you make clean and rebuild
> just to make sure ?
Rebuilt with make clean - I am getting the same oops...
Ed
In message <[email protected]> you write:
> On Fri, Dec 20, 2002 at 06:29:18PM -0500, Ed Tomlinson wrote:
> > Dave, with the pull from this morning (8am EST), it almost works modular.
> > I get:
> >
> > Dec 20 18:20:19 oscar upsd[636]: Communication established
> > Dec 20 18:20:47 oscar kernel: Linux agpgart interface v0.100 (c) Dave Jone
s
> > Dec 20 18:20:47 oscar kernel: agpgart: Detected VIA MVP3 chipset
> > Dec 20 18:20:47 oscar kernel: agpgart: AGP aperture is 64M @ 0xe0000000
> > Dec 20 18:20:58 oscar kernel: [drm] Initialized mga 3.1.0 20021029 on mino
r 0
> > Dec 20 18:20:58 oscar kernel: Module agpgart cannot be unloaded due to uns
afe usage in drivers/char/ag
> > p/backend.c:58
>
> This one is due to the way AGPGART does (or has done for the last 3
> years) its module locking. It does a MOD_INC_USE_COUNT as soon as
> someone calls the acquire routines.
Which is racy under SMP, and under preempt, which is why it's
deprecated.
> (So you can't unload agpgart whilst you've a 3d using app (like X)
> open). This seems quite sensible, but these days you can't unload
> agpgart.ko anyway because the chipset module (via-agp.ko in your
> case) already has it 'in use', so I'm tempted to drop those bits.
If this is true (it usually is), you can simply drop them. There are
other cases where the caller is not grabbing references, so
MOD_INC_USE_COUNT is better than nothing (should the warning stay for
2.6? Good question).
Hope that helps,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
On Mon, Dec 23, 2002 at 12:10:47PM +1100, Rusty Russell wrote:
> > This one is due to the way AGPGART does (or has done for the last 3
> > years) its module locking. It does a MOD_INC_USE_COUNT as soon as
> > someone calls the acquire routines.
> Which is racy under SMP, and under preempt, which is why it's
> deprecated.
Crapola. I've just realised why this is no longer relevant.
I've moved what this was protecting into the per chipset modules.
Right now its possible to load modules, start x (which loads DRM),
then rmmod via_agp from under its feet. result - boom when something
tries to use 3d.
Sure it's unlikely someone would be crazy enough to try and do this,
and they deserve what they get, but it's not exactly clean, or nice.
So where is the documentation describing module locking de jour ?
> > (So you can't unload agpgart whilst you've a 3d using app (like X)
> > open). This seems quite sensible, but these days you can't unload
> > agpgart.ko anyway because the chipset module (via-agp.ko in your
> > case) already has it 'in use', so I'm tempted to drop those bits.
> If this is true (it usually is), you can simply drop them.
I'll need 'something' in the chipset drivers. The first thing that
jumps to mind is to give the chipset drivers an 'acquire' op
which does the locking much like the old agp_backend_acquire() does.
> There are other cases where the caller is not grabbing references, so
> MOD_INC_USE_COUNT is better than nothing (should the warning stay for
> 2.6? Good question).
Why exactly isn't it safe any more? If there's documentation on this,
I'd love to read it. If there isn't, there really should be.
Dave
--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs
In message <[email protected]> you write:
> On Mon, Dec 23, 2002 at 12:10:47PM +1100, Rusty Russell wrote:
> > > This one is due to the way AGPGART does (or has done for the last 3
> > > years) its module locking. It does a MOD_INC_USE_COUNT as soon as
> > > someone calls the acquire routines.
> > Which is racy under SMP, and under preempt, which is why it's
> > deprecated.
>
> Crapola. I've just realised why this is no longer relevant.
> I've moved what this was protecting into the per chipset modules.
> Right now its possible to load modules, start x (which loads DRM),
> then rmmod via_agp from under its feet. result - boom when something
> tries to use 3d.
>
> Sure it's unlikely someone would be crazy enough to try and do this,
> and they deserve what they get, but it's not exactly clean, or nice.
>
> So where is the documentation describing module locking de jour ?
Here's the FAQ again. Note that the init stuff is not currently true
(you can enter a module doing init, with possibly bad results),
because Linus patched my code. It's on my TODO list to revisit this
issue.
> > > (So you can't unload agpgart whilst you've a 3d using app (like X)
> > > open). This seems quite sensible, but these days you can't unload
> > > agpgart.ko anyway because the chipset module (via-agp.ko in your
> > > case) already has it 'in use', so I'm tempted to drop those bits.
> > If this is true (it usually is), you can simply drop them.
>
> I'll need 'something' in the chipset drivers. The first thing that
> jumps to mind is to give the chipset drivers an 'acquire' op
> which does the locking much like the old agp_backend_acquire() does.
The problem is, a module basically can't lock itself. Hence you
should expose an "owner" field when you register an interface, and
have the caller do the locking. There are other possibly solutions,
but none of them were as simple as "lock down before you call in".
> > There are other cases where the caller is not grabbing references, so
> > MOD_INC_USE_COUNT is better than nothing (should the warning stay for
> > 2.6? Good question).
>
> Why exactly isn't it safe any more? If there's documentation on this,
> I'd love to read it. If there isn't, there really should be.
Consider the fictitious my_module.c:
static void my_function_called_through_ptr(void)
{
MOD_INC_USE_COUNT;
...
MOD_DEC_USE_COUNT;
return;
}
If you get preempted before the MOD_INC_USE_COUNT, an unload could
occur, boom. If you get preempted after the MOD_DEC_USE_COUNT, but
before the return, and an unload occurs, boom.
If it were not called through a pointer, (ie. called by name) then it
means the module calling it has a reference (this is done
automatically in symbol resolution, or symbol_get()), so no problem.
[ As an implementation detail, this is *not* a problem on SMP without
preemption. But that's an implementation detail, and it wasn't true
with the previous module implementation. ]
Hope this helps,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Golden Rule: If you are calling though a function pointer into a
(different) module, you must hold a reference to that module.
Otherwise you risk sleeping in the module while it is unloaded.
Q: How do I get a reference to a module?
A: Usually, a successful call to try_module_get(owner). You don't
need to check for owner != NULL, BTW.
Q: When does try_module_get(owner) fail?
A: When the module is not ready to be entered (ie. still in
init_module) or it is being removed. This prevents you
entering the module as it is being discarded (init might fail, or
it's being removed).
Q: But the modules' init routine calls my register() routine which
wants to call back into one of the function pointers immediately,
and so try_module_get() fails! (because the module is not finished
initializing yet)
A: You're being called from the module, so someone already has a
reference (unless there's a bug), so you don't need a
try_module_get().
This does mean that if you were to register a structure for
*another* module (does anyone do this?) you'd need to have a
reference to it.
Q: How do I put the reference back?
A: Using module_put(owner) (owner == NULL is OK).
Q: Do I really need to put try_module_get() before every function ptr
call?
A: If the function does not sleep (any cannot be preempted) ie. is
called in softirq or hardirq context, you can omit this step, since
you obviously won't sleep inside the module.
Also, most structs have clear "start" and "stop" functions
(eg. mount/umount), so you only need one try_module_get()
on start, and module_put() on stop.
Q: Is it safe to call try_module_get() and module_put() from an
interrtupt / softirq?
A: Yes.
Q: My code use "MOD_INC_USE_COUNT". Do I still need to adjust my
module count when someone calls one of my functions?
A: No, you never need to adjust your own module count. There are five
ways a function in your module can get called: firstly, it could be
your module_init() function, in which case the module code holds a
reference. It could be another module using one of your
EXPORT_SYMBOL'ed functions, in which case you cannot be removed
since they would have to be removed first. It could be a module
which found an EXPORT_SYMBOL'ed function using symbol_get(), in
which case they hold a reference count. It could be through a
function pointer which your module gave out previously, which is
discussed above. Finally, it could be from within your own module,
in which case someone must already hold a reference.
Q: My code uses "__MOD_INC_USE_COUNT(reg->owner)", but now I get a
warning at runtime that it is unsafe. What do I need to do?
A: You need to use try_module_get(), and not call into the module if
it fails (act as if it hasn't registered yet). Note that you no
longer need to check for NULL yourself, try_module_get() does that.
Q: My code used "GET_USE_COUNT(module)" to get the reference count.
A: Don't do that. If module unloading is disabled, there is no
reference count, and there is never a single value you can assign
to.
Q: My code used "try_inc_mod_count(module)" to get the reference
count. Should I change it?
A: No hurry. try_module_get() is exactly the same: the new name
reflects that this is now the only way to get a reference.
Q: How does the code in try_module_get() work?
A: It disables preemption for a moment, checks the live flag, and then
increments a per-cpu counter if the module is live. This is even
lighter-weight (in icache and cycles) than using a brlock, but has
the same effect. If CONFIG_MODULE_UNLOAD=n, it just becomes a
check that the module is live.
Q: How does the module remove code work?
A: It stops the machine by scheduling threads for every other CPU,
then they all disable interrupts. At this stage we know that noone
is in try_module_get(), so we can reliably read the counter. If
zero, or the rmmod user specified --wait, we set the live flag to
false. After this, the reference count should not increase, and
each module_put() will wake us up, so we can check the counter
again.
Q: Are these changes all so you could implement an in-kernel module
linker?
A: No, they were to prevent load and unload races without altering
every module, nor introducing drastic new requirements.
Q: Doesn't putting linking code into the kernel just add bloat?
A: The total linking code is about 200 generic lines, 100
x86-specific lines. The ia64 linking code is about 500 lines (it's
the most complex). Richard Henderson has a great suggestion for
simplifying it furthur, which I'm implementing now. "insmod" is
now a single portable system call, meaning insmod can be written in
about 20 lines of code.
The previous code had to implement the two module loading
system calls, the module querying system call, and the /proc/ksyms
output, required a little more code than the current x86 linker.
Q: Why not just fix the old code?
A: Because having something so intimate with the kernel in userspace
greatly restricts what changes the kernel can make. Moving into
the kernel means I have implemented modversions, typesafe
extensible module parameters and kallsyms without altering
userspace in any way. Future extensions won't have to worry about
the version of modutils problem.
Q: Why not implement two-stage insert / two-stage delete?
A: Because I implemented it first and it sucked. And because this
*is* two-stage insert and two-stage delete, without exposing it to
the modules using it, with the added advantage that the second
stage is atomic (activation/deactivation is simply changing
mod->live, modulo locking implementation magic detailed above).
This prevents the race between deactivating the module and finding
that someone has starting using it as you are deactivating it.
Hi,
Rusty Russell wrote:
> > So where is the documentation describing module locking de jour ?
>
> Here's the FAQ again. Note that the init stuff is not currently true
> (you can enter a module doing init, with possibly bad results),
> because Linus patched my code. It's on my TODO list to revisit this
> issue.
Rusty, you should do this rather soon. Could you please explain, how you
intend to solve the module init races? The "don't enter module until
init is finished" concept doesn't work, so what else do you want to do?
> Q: How does the code in try_module_get() work?
> A: It disables preemption for a moment, checks the live flag, and then
> increments a per-cpu counter if the module is live. This is even
> lighter-weight (in icache and cycles) than using a brlock, but has
> the same effect.
Q: But if I have a large number of modules (e.g. netfilter) and I have
to use try_module_get() for each of them, won't it still affect
performance?
> Q: Are these changes all so you could implement an in-kernel module
> linker?
> A: No, they were to prevent load and unload races without altering
> every module, nor introducing drastic new requirements.
Q: Wasn't it possible to do the rewrite in several steps? What made it
so urgent that everything had to be 2.6?
> Q: Doesn't putting linking code into the kernel just add bloat?
> A: The total linking code is about 200 generic lines, 100
> x86-specific lines. [...]
>
> The previous code had to implement the two module loading
> system calls, the module querying system call, and the /proc/ksyms
> output, required a little more code than the current x86 linker.
2.4.20: wc kernel/module.c include/linux/module.h
1283 3800 29616 kernel/module.c
415 1609 13173 include/linux/module.h
1698 5409 42789 total
2.5.52: wc kernel/module.c kernel/params.c arch/i386/kernel/module.c
include/linux/module*
1414 4400 35284 kernel/module.c
338 1093 8182 kernel/params.c
124 412 3347 arch/i386/kernel/module.c
368 1174 9954 include/linux/module.h
61 244 1958 include/linux/moduleloader.h
127 619 5199 include/linux/moduleparam.h
2432 7942 63924 total
Q: Doesn't the querying system call provide the same information as
/proc/modules and /proc/ksyms? Is procfs now required to load modules?
Q: How can I debug modules with ksymoops?
Q: Why was it necessary to remove the system calls? Wasn't it possible
to emulate them?
> Q: Why not just fix the old code?
> A: Because having something so intimate with the kernel in userspace
> greatly restricts what changes the kernel can make. Moving into
> the kernel means I have implemented modversions, typesafe
> extensible module parameters and kallsyms without altering
> userspace in any way. Future extensions won't have to worry about
> the version of modutils problem.
Q: Can I see the modversions implementation?
Q: Was there an analysis, which discussed the limitation of the old
code? Was it really impossible to design a more flexible interface,
which left as much as possible in userspace?
Rusty, could you _please_ start answering questions instead of ignoring
them or is there no discussion required anymore, now that your code is
in the kernel?
bye, Roman