I wonder if it would be useful to have something like this in tree.
It states trivial things for anyone who looked at disassembly few times
but still...
Signed-off-by: Alexey Dobriyan <[email protected]>
---
Documentation/process/code-generation.rst | 196 ++++++++++++++++++++++++++++++
1 file changed, 196 insertions(+)
new file mode 100644
--- /dev/null
+++ b/Documentation/process/code-generation.rst
@@ -0,0 +1,196 @@
+Code generation
+===============
+
+1) Generic techniques
+---------------------
+
+### a) Inlining/uninlining function calls ###
+
+External function call is serious business from code generation point of view.
+ABIs require that specific arguments are placed into specific registers before
+doing the call forcing spilling and register shuffling to accomodate ABI rules.
+Clobbered registers which aren't used by a function are wasted. Declaring
+function as ``static inline`` in a header gives compiler more information
+to work with.
+
+However, excessing inlining often leads to code bloat for no measurable
+performance gains. In such case it is probably better to save on generated code
+for icache, disk I/O and network bandwidth costs.
+
+Use ``noinline`` attribute to prevent inlining inside translation unit and
+see what happens:
+
+.. code-block:: c
+
+ noinline
+ int f()
+ {
+ ...
+ }
+
+It is hard to advice any more than that as modern compilers generate code in
+mysterious ways.
+
+
+### b) Appending arguments ###
+
+Some functions are thin wrappers appending an argument or two to another
+function which actually does the job:
+
+.. code-block:: c
+
+ int g(int, int, flag_t);
+ int f(int a, int b)
+ {
+ return g(a, b, FLAG_C);
+ }
+
+Appending an argument at the end adds minimum amount of code:
+
+.. code-block:: none
+
+ f:
+ mov edx, FLAG_C
+ jmp g
+
+Appending an argument in the middle or in the beginning will generate
+reshuffle sequence:
+
+.. code-block:: none
+
+ f:
+ mov edx, esi
+ mov esi, edi
+ mov edi, FLAG_C
+ jmp g
+
+Do not enforce this rule religiously as there may be other reasons for
+specific argument order most notably keeping related arguments together
+at source level.
+
+
+2) Architecture specific issues (i386/x86_64)
+---------------------------------------------
+
+### a) Member placement ###
+
+First member of any structure is very special on i386/x86_64: compiler will
+use ``[r32]`` or ``[r64]`` addressing mode which has the shortest encoding.
+After laying out members of a structure into cachelines for performance,
+move most often used member of the first cacheline to the very beginning.
+
+Done that, pay attention to bytes 1--127. Members placed there will be encoded
+with ``[r64+disp8]`` encoding (or ``[r32+disp8]`` on i386). This is only 1 byte
+longer than encoding used for the first member but 3 bytes _shorter_ than
+``[r64+disp32]`` used for all other members. Try to shift more often used
+members into first 2 cachelines.
+
+"Refugee" members living in byte 128 and beyond can be placed in any order.
+
+
+### b) Implicit 32/64-bit casts
+
+Avoid casts which change signedness and/or bitness of a value.
+
+If some piece of data appears in the code it generally should be kept in its
+original type unless there are specific reasons to do otherwise (packing, etc).
+With C's seemingly arcane implicit and explicit casting rules this is good advice
+from programming language point of view as well.
+
+Given the code:
+
+.. code-block:: c
+
+ void f(size_t);
+
+ int len = strlen(s);
+ f(len);
+
+if compiler doesn't or can't maintain value ranges through casts it will have
+no choice but to assume that all "size_t" values are possible and emit MOVSX
+instruction:
+
+.. code-block:: none
+
+ mov rdi, ...
+ call strlen
+ movsx rdi, eax
+ call f
+
+MOVSX by itself it not a problem but it a) may be 1 byte longer than MOV
+instruction with same arguments and b) it won't be handled by register renaming,
+increasing dependency chains by 1 instruction.
+
+
+### c) 64-bitness ###
+
+64-bit instruction are 1-byte longer than corresponding 32-bit equivalents
+on x86_64.
+
+There is one big 64-bit enabler which is dynamic memory allocation: all
+kmalloc variant accept ``size_t`` and ``sizeof`` operator returns ``size_t``.
+
+Do not use 64-bit/``size_t`` unless strictly necessary (pointer-to-integer
+conversion, syscall ABI interfaces, integers which can be genuinely big on
+big machines, statistics).
+
+Use 32-bit ``unsigned int``. Kernel simply doesn't to individual 4+ GB
+allocations and if it does it probably goes via page allocator. Such huge
+amounts of memory simply aren't needed: network doesn't do gigabyte packets,
+VFS caps IO at 2 GB minus a little and interating with userspace via
+``copy_from_user``/``copy_to_user`` is capped at ``INT_MAX`` as well.
+
+.. code-block:: c
+
+ #define MAX_RW_COUNT (INT_MAX & PAGE_MASK)
+
+The only exceptional case is ``size_t`` value being passed directly into
+a standard function accepting ``size_t`` (``memset``, ``memcpy``, ...).
+Truncating value to 32-bit won't do anything useful in this case.
+
+
+### d) 16-bitness ###
+
+16-bit instructions will generate 1-byte operand size override prefix (66)
+which again bloats an instruction by 1 byte. Unlike REX prefixes, this is
+unavoidable.
+
+It is better to use 16-bit types at ABI/protocol/memory level, convert
+to plain ``int``/``unsigned int`` as soon as possible and work with that.
+
+Preferred order of bitness on x86_64 is:
+
+ 32/8-bit > 64-bit > 16-bit.
+
+3) Architecture specific issues (arm/arm64)
+-------------------------------------------
+
+### Constant flags value selection ###
+
+"Tight" constants can be loaded into a register in 1 instruction on arm and
+other RISC architectures:
+
+.. code-block:: c
+
+ int f()
+ {
+ return 1;
+ }
+
+.. code-block:: none
+
+ 00000000 <f>:
+ 0: e3a00001 mov r0, #1
+ 4: e12fff1e bx lr
+
+Constants which don't fit into 12-bit window on arm will be loaded from memory
+or constructed with 2 loads:
+
+.. code-block:: none
+
+ 00000000 <f>:
+ 0: e59f0000 ldr r0, [pc] ; 8 <f+0x8>
+ 4: e12fff1e bx lr
+ 8: 00000801 .word 0x00000801 ; <=== 2049
+
+After settling on flags/constants push often used values together bitwise.
On Thu, Mar 05, 2020 at 10:02:53PM +0300, Alexey Dobriyan wrote:
> I wonder if it would be useful to have something like this in tree.
>
> It states trivial things for anyone who looked at disassembly few times
> but still...
It's a bit x86-specific, and it ignores things which matter more like
paying attention to cacheline boundaries in structures. I'm also not
convinced that those who come after us in ten years and have to widen
everything from int to long will thank us for saving a few bytes of
Icache.
> +.. code-block:: c
> +
> + int g(int, int, flag_t);
> + int f(int a, int b)
> + {
> + return g(a, b, FLAG_C);
> + }
> +
> +Appending an argument at the end adds minimum amount of code:
> +
> +.. code-block:: none
> +
> + f:
> + mov edx, FLAG_C
> + jmp g
> +
> +Appending an argument in the middle or in the beginning will generate
> +reshuffle sequence:
> +
> +.. code-block:: none
> +
> + f:
> + mov edx, esi
> + mov esi, edi
> + mov edi, FLAG_C
> + jmp g
But if f is static inline, then it makes no difference at all.
> +Constants which don't fit into 12-bit window on arm will be loaded from memory
> +or constructed with 2 loads:
The 12-bit window on ARM is 8 bits rotated by a power of 4. So for example,
0xc000003f is a valid constant, but 0x000001f7 is not. I don't know what
arm64 does for constants.
On Thu, 5 Mar 2020 22:02:53 +0300
Alexey Dobriyan <[email protected]> wrote:
> I wonder if it would be useful to have something like this in tree.
>
> It states trivial things for anyone who looked at disassembly few times
> but still...
If nothing else, it needs an introductory paragraph saying what it's about
and why somebody might want to read it.
In general, I'm really reluctant to add stuff to the coding-style
documents (and that's pretty much what this is) without a pretty strong
feeling that it describes generally accepted practices. So I'd definitely
want a fair number of acks before taking something like this.
Thanks,
jon
On Thu, 05 Mar 2020, Alexey Dobriyan <[email protected]> wrote:
> +1) Generic techniques
> +---------------------
> +
> +### a) Inlining/uninlining function calls ###
What others said about the content is more important, but if you end up
doing this, please drop the 1) and a) stuff in the headings. It's just
manual work someone has to update for no real gain.
BR,
Jani.
--
Jani Nikula, Intel Open Source Graphics Center