2004-04-20 09:08:25

by Zoltan Menyhart

[permalink] [raw]
Subject: Dynamic System Calls & System Call Hijacking



NAME

dyn_syscall_reg, hijack_syscall - Register a system call

SYNOPSIS

#include <asm/dyn_syscall.h>

int
dyn_syscall_reg(const char *name,
const unsigned int syscall_no,
const dyn_syscall_t fp);
int
hijack_syscall(const char *name,
const unsigned int syscall_no,
const dyn_syscall_t fp);

DESCRIPTION

"dyn_syscall_reg()" and "hijack_syscall()" are exported services
available for loadable kernel modules.

"dyn_syscall_reg()" registers a new, dynamic system call.
If "syscall_no" is zero, then an otherwise unused system call number
will be assigned.

"hijack_syscall()" registers a system call which overloads an
existing one.

"name" points to a string that shall persist while the system call is
alive.

"syscall_no" should be in the range of
[__NR_ni_syscall + 1... __NR_ni_syscall + NR_syscalls).

"fp" refers to the new system call.
For the IA64 architecture, the function descriptor "dyn_syscall_t"
refers to a structure containing the program counter and the global
pointer.

User applications can find this system call number in
"/proc/sys/kernel/dynamic_syscalls/<name>" or in
"/proc/sys/kernel/hijacked_syscalls/<name>", respectively.
On read, each of these files contains a 4 digit decimal number
terminated with a '\n' character.

RETURN VALUE

On success, the system call number accepted / assigned is returned.

On error, the following codes may be returned:

-ENOENT: No more free system call is available -
"dyn_syscall_reg()" only
-EINVAL: Illegal system call number - both
-EBUSY: System call is already in use - "dyn_syscall_reg()" only
-ENOMEM: Cannot create "/proc/..." - both

SEE ALSO

syscall_unlock, prep_restore_syscall, syscall_trylock,
dyn_syscall_unreg, restore_syscall


--------------------------------------------------------------------------------


NAME

syscall_unlock, syscall_trylock - Unlock / try to lock a system call
prep_restore_syscall - Prepare to unregister a system call

SYNOPSIS

#include <asm/dyn_syscall.h>

int
syscall_unlock(const char *name,
const unsigned int syscall_no);
int
syscall_trylock(const char *name,
const unsigned int syscall_no);

int
prep_restore_syscall(const char *name,
const unsigned int syscall_no);

DESCRIPTION

"syscall_unlock()", "syscall_trylock()" and "prep_restore_syscall()"
are exported services available for loadable kernel modules.

Each system call is protected by a semaphore.

When a new system call is added, it is locked for write.
Regular system call invocation tries to take the semaphore for read.
Unless it is "syscall_unlock()"-ed, any attempt to use the system call
will be refused and "-ENOSYS" will be reported.

Before undoing a system call registration, it is necessary to lock out
any further invocation of the system call by re-locking it for write.
(They will be refused by returning "-ENOSYS".)
Apart from some small administration task, "prep_restore_syscall()"
attempts to do it. If it fails (indicated by "-EAGAIN" returned), then
there is at least one "living call" which may be "part way" through
the system call code.

"syscall_trylock()" should be invoked repeatedly while it returns
"-EAGAIN". In order not to over penalise other tasks, "schedule()"
should be invoked at each iteration. If the system call is blocking,
i.e. there can be tasks sleeping inside the system call, then they have
to be woke up. In such a case, it is recommended to sleep a bit
between two iterations of "syscall_trylock()".

"name" should be the same as that was used during the registration.

"syscall_no" should be in the range of
[__NR_ni_syscall + 1... __NR_ni_syscall + NR_syscalls).

RETURN VALUE

On success, zero is returned.

"syscall_trylock()" and "prep_restore_syscall()" return "-EAGAIN" if
they have failed to take the semaphore for write.

On error, the following codes can be returned:

-EBADF: Name or system call number does not match the parameters
which was used during the system call registration
-EINVAL: Illegal system call number

SEE ALSO

dyn_syscall_reg, hijack_syscall, dyn_syscall_unreg, restore_syscall


--------------------------------------------------------------------------------


NAME

dyn_syscall_unreg, restore_syscall - Unregister a system call

SYNOPSIS

#include <asm/dyn_syscall.h>

int
dyn_syscall_unreg(const char *name,
const unsigned int syscall_no);
int
restore_syscall(const char *name,
const unsigned int syscall_no);

DESCRIPTION

"dyn_syscall_unreg()" and "restore_syscall()" are exported services
available for loadable kernel modules.

"dyn_syscall_unreg()" unregisters a dynamic system call.

"restore_syscall()" restores a hijacked system call.

"name" should be the same as that was used during the registration.

"syscall_no" should be in the range of
[__NR_ni_syscall + 1... __NR_ni_syscall + NR_syscalls).

RETURN VALUE

On success, zero is returned.

On error, the following codes can be returned:

-EBADF: Name or system call number does not match the parameters
which was used during the system call registration
-EINVAL: Illegal system call number

SEE ALSO

dyn_syscall_reg, hijack_syscall,
syscall_unlock, syscall_trylock, prep_restore_syscall



Attachments:
dyn_syscall_man.txt (5.16 kB)

2004-04-22 19:52:06

by Pavel Machek

[permalink] [raw]
Subject: Re: Dynamic System Calls & System Call Hijacking

Hi!

> - Can't recompile the kernel, otherwise you gonna lose RedHat guarantee ?
> Or some ISVs like whose name starts with an "O" and terminates with "racle"
> ain't gonna support it ?

> + No problem, I'll load your syscall in a module.

Well, by forcing syscall in, you loose your guarantee, too.
cat /dev/urandom > /dev/kmem

"RedHat, help, my machine crashed."


> Your remarks will be appreciated.

I hope it at least taints the kernel.

And you did test on smp kernel, trying to race syscall calling against
your module load/unload, right?

--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2004-04-23 11:47:01

by Zoltan Menyhart

[permalink] [raw]
Subject: Re: Dynamic System Calls & System Call Hijacking

Pavel Machek wrote:
>
> Well, by forcing syscall in, you loose your guarantee, too.

Strictly speaking, you are right.

Let me give you an example how we are going to use this dynamic syscall feature:

Assuming a client of ours has a big application running on an "official" kernel.
We load our performance enhancement tool with my dynamic syscall stuff.
If the client observes better performance, then s/he loads this tool at each
re-boot.
Should the kernel crash, s/he does not load it and checks to see if the problem
happens...

Another example is using it as a development tool.
Our performance enhancement tool includes a syscall. It is 100 times quicker to
load it for testing as a module and not be obliged to recompile the kernel, re-
boot it.

> I hope it at least taints the kernel.

As this dynamic syscall feature is intended to be transparent, it does not do.
If someone wants to taint the kernel, it's just one line more of the code.

I checked: RedHat's AS 3.0 does not taint the kernel for 3rd party modules,
(it only does for the sii6512 software raid module).

Note that my patch is against 2.6.4. If you need to play with a 2.4.*, then
at least "kallsyms" should be changed onto "ksyms".

> And you did test on smp kernel, trying to race syscall calling against
> your module load/unload, right?

"dyn_syscall.ko" can be unloaded but it is unsafe.
Here is the window:

- A CPU picks up the address of my syscall link code from "sys_call_table"
then it is pre-empted for a while
- Another CPU patches back the old address of "sys_ni_syscall" into
"sys_call_table" and unloads "dyn_syscall.ko"
- The first CPU is back to jump at my link code in "dyn_syscall.ko"

On a client's machine, it is loaded once (e.g. at boot time).
You can try to unload it (as I did) during the development, you do not risk
much, but it is recommended to keep it loaded at the clients.

On the other hand, unloading modules which have correctly unregistered their
system calls is 100% safe.

I did test it on machine with 16 CPUs, but testing cannot prove that there is
no window. I'm going to summarize how the synchronization mechanism works.
There are two cases to consider:
- race among multiple syscall register / unregister operations
- race between unloading a syscall and its clients

Let's start with the first one.
My dynamic syscall feature includes a shadow system call table.
A table entry consists of:

- Name of the system call
- The saved syscall address from "sys_call_table" (atomic variable)
- A semaphore (initialized as if it were taken for write)
- Function descriptor of the new system call
- etc.

The synchronization mechanism is based on the atomic variable in each
entry of the shadow syscall table, that saves the old syscall address from
"sys_call_table":
- 0 means not in use
- 1 means reserved (going to be used)
- original "sys_call_table" entry | 1 means preparing to undo
- Otherwise saves the original "sys_call_table" entry (not an odd value)

For dynamic system call assignment:
- Atomically check & decrement number of the free syscall entries.

Dynamically assigned and hijacked system call entries form two distinct sets.
A dynamically assigned syscall cannot be hijacked. No nested hijacking.
(Therefore hijacking does not care for the number of the free syscall entries.)

For both the dynamically assigned and hijacked system calls:

- Reserve the corresponding shadow syscall table entry by use of a
compare & swap atomic operation (see above)
- Do the other initialization and save the syscall address from "sys_call_table"
- Patch the address of my linkage code into the corresponding entry in
"sys_call_table"
- Unlock the semaphore

- Undo operations work in the reverse order

Race between unloading a syscall and its clients:

- When a new system call is added, it is locked for write.
- Regular system call invocation tries to take the semaphore for read.
- Unless the semaphore is unlocked, any attempt to use the system call
will be refused and "-ENOSYS" will be reported.

- Before undoing a system call registration, it is necessary to lock out
any further invocation of the system call by re-locking it for write.
If it fails, then there is at least one "living call" which may be "part way"
through the system call code.
"syscall_trylock()" should be invoked repeatedly while it returns "-EAGAIN".

I hope I have not missed anything :-)

Thanks,

Zolt?n