Martin Husemann
martin@NetBSD.org
In the end we were able to fix ...
and that "we" includes:
Special thanks to everyone involved (even if I happened to forgot to mention you in above list)!
The CubieTruck is a tiny board based on the Allwiner A20 SoC with lots of gadgets and useful peripherals on board.
I put it into an old SCSI enclosure, together with a SATA disk.
Initially quite a few drivers where missing, but as a few other developers received a CubieBoard or CubieTruck at the same time, this got fixed quickly.
With the help of others, I wrote the "awge" driver for network part.
Now the only important missing parts are:
Otherwise the CUBIETRUCK support is pretty complete.
ARM has supported big endian CPUs for quite a while, but the old "big endian mode" had some quirks build into it that got in the way when they moved on to scalar features, SIMD, and in the end the AArch 64 bit ARM architecture.
So ARM dropped support for the old big endian mode, renamed the object format used to "BE32" and created a new object format, called "BE8".
Marketing required compatibility to old object files.
But the new cores required all instructions to be encoded in little endian byte order!
Luckily all ARM provided tools already marked code sections with special symbols.
So the "compatibility magic" was put into the linker, and the ABI extended by a subset of the special symbols always used by ARM tools.
Simple idea: find all 32bit code parts and swap the instructions accordingly, find all 16bit thumb instruction parts and swap those.
$a
, $a.1
.. $a.N
$t
, $t.1
.. $t.N
$d
, $d.1
.. $d.N
After solving basic support issues and having the machine boot up to multiuser, various issues showed up:
To allow sending debug output to the serial port before attaching any drivers, a simple polled "early console" is setup.
This did not work, but printed garbage and caused a long delay before the kernel com driver attached and took over console output.
After noticing that the early console worked well for little endian kernels, the bug was easy to spot:
The three minimalistic functions used for polled console do direct hardware access (no bus_space abstraction involved), and those accesses did not provide byte swapping if needed.
After fixing, the code to wait for the com device to become ready to transmit a character looks like this:
while ((le32toh(uart_base[com_lsr])&& LSR_TXRDY) == 0 && --timo > 0) ;
Adding a few le32toh() and htole32() calls, like in the example above, fixed the issue.
The MMC driver uses DMA to transfer data from the SD card to memory.
The setup for the DMA descriptors passed to the device needs to explicitly swap the data passed to the DMA engine into the endianes expected by the device - which is always little endian on this SoC.
A typical DMA descriptor consists of a few status/command bits and a target address for the operation. Imagine what goes wrong if the address is in opposite byte order: the engine will overwrite arbitrary memory.
Again, adding a few htole32() calls fixed the issue.
The kernel module loadable files are not finally linked (with -be8 option), but instead handled by the kernel object loader.
The loader had to be taught about the magic $a (and friends) symbols and do byte swapping post-load.
Luckily we have a proper machine dependent function called after symbol loading, usually only doing some data/instruction cache consistency flushing. Inserting a fix up call there looks like this:
int kobj_machdep(kobj_t ko, void *base, size_t size, bool load) { if (load) { #if __ARMEB__ if (CPU_IS_ARMV7_P()) kobj_be8_fixup(ko); #endif
Then we need a simple function to categorize symbols:
static enum be8_magic_sym_type be8_sym_type(const char *name, int info) { if (ELF_ST_BIND(info) != STB_LOCAL) return Other; if (ELF_ST_TYPE(info) != STT_NOTYPE) return Other; if (name[0] != '$' || name[1] == '\0' || (name[2] != '\0' && name[2] != '.')) return Other; switch (name[1]) { case 'a': return ArmStart; case 'd': return DataStart; case 't': return ThumbStart; default: return Other; } }
The following code iterates all symbols in the new loaded module:
/* * Count all special relocations symbols */ ksyms_mod_foreach(ko->ko_name, be8_ksym_count, &relsym_cnt);
where relsym_cnt
is:
long relsym_cnt = 0;
and the callback function be8_ksym_count
looks like:
static int be8_ksym_count(const char *name, int symindex, void *value, uint32_t size, int info, void *cookie) { size_t *res = cookie; enum be8_magic_sym_type t = be8_sym_type(name, info); if (t != Other) (*res)++; return 0; }
After counting we allocate storage for all the relevant symbols, and run another iteration where each symbol, together with type and address is stored.
This array then is sorted by address (calling kheapsort()).
Finally we run through the array in ascending address order, and for all sections, depending on type of the symbol describing it, swap 32 bits, swap 16 bits, or do nothing.
kobj_machdep() will flush caches after we are done.
In a standard build, NetBSD does not use the gcc provided build infrastructure to build libgcc, instead all "configury" is done upfront during a step called "mknative", and the resulting makefile fragments and header files are then committed to the NetBSD tree.
To make sure the symbols in libgcc are all created with visibility "hidden", some tricks are played on the intermediate object files that used to include a "strip" and a "ld -r" invocation.
Now the resulting libgcc, while still being a linkable object,
had all $a
, $t
and $d
symbols stripped.
But: when finally linking executables, the linker is invoked with -be8 option and tries to do the magic byte swapping for thumb and arm instructions.
If during this links it pulls in some function from libgcc, the swapping will not work for that function, as we stripped the "unused" local symbols in the libgcc pre-linking step.
The first program affected during a multi-user boot was
fc-cache (after installation or X/font related updates), it
uses __popcountsi2
on armv7.
The obvious fix: do not strip.
But other symbols had to be removed, as we would get duplicated symbols (it is a mess, maybe better do not ask).
So strip got replaced by a slightly more magic objcopy invocation that removed all local symbols but left the special ones in place.
Gdb did work on core files, but not when trying to start programs from within - by using the "run" command.
Trying to "run" anything caused weired failure inside ld.elf_so, which caused me to make an "educated guess" that was spot-on.
When gdb tries to run a child process, it inserts a breakpoint in a know-to-gdb dummy function in the dynamic loader, on NetBSD this function is (in ARM assembly):
0x21e8 <_rtld_debug_state>: bx lr
(in C it is just an empty function with some magic to prevent the compiler from optimizing it too much)
This breakpoint is hit whenever a new shared object is loaded by ld.elf_so (and the main program binary is the first one to trigger). Gdb then extracts all necessary information about the new shared library (or main module) and continues from the breakpoint.
On ARM, a gdb breakpoint is done by replacing the instruction temporarily with a special illegal instruction and then trap the SIGILL via ptrace. The replacement instruction is coded as an array of bytes in gdb - and there is a little endian and a big endian variant.
Code inspection showed that Gdb upstream had added a new member "byte_order_for_code" in the struct "gdbarch_info", but the NetBSD specific breakpoint instruction was still selected by the old "byte_order" member. Upstream had fixed it for all other ARM targets, but somehow the NetBSD specific code is not in sync and did not get updated.
This should not happen at various levels, but...
So gdb selected the different endian encoding for the break point illegal instruction, and instead of causing a trap, this just did some random arithmetic and continued in whatever function happened to live in ld.elf_so next to it:
0x21e8 <_rtld_debug_state>: smlattne r0, r6, r0, r0 0x21ec <_rtld_objlist_clear>: mov r12, sp 0x21f0 <_rtld_objlist_clear+4>: push {r3, r4, r11, r12, lr, pc}
(smlattne = signed multiply long accumulate, top half * top half, conditional if "not equal")
This corrupted internal ld.elf_so data structures immediately before returning to ld.elf_so internal fixup work.
Obvious fix: select breakpoint according to new member (and better sync with upstream)
We found a few issues that were unrelated to endianes, and especially the latter two would have shown up on similar tests with a little endian system as well:
NetBSD offers the standard Itanium interface for unwinding stacks (but not the convoluted HP version of the API).
The implementation is derived from LLVM's Compiler-RT, but it had a bug that got triggered during the ATF (automatic test framework) internal tests.
ATF is written in C++ and heavily relies on exceptions, e.g. to abort a failing test case.
One of the steps in exception unwinding is to identify the call frame from the current %pc value and then use the corresponding Call Frame Information stored by the compiler to unwind the stack properly and find the parent call frame.
Code details were just slightly different on BE8 to trigger a bug in the binary search to identify the relevant CFI entry.
The remaining issues could be called "false positives" - test failures that got fixed by fixing the tests.
The NEON FPU in Cortex-A7 do not support raising IEEE exceptions.
Userland can detect this by setting FP_X_INV
in the
exception mask and reading it back: on Cortex NEON it will not
"stick".
So tests grew code like this:
#elif defined(__arm__) && !__SOFTFP__ /* * Some NEON fpus do not implement IEEE exception handling, * skip these tests if running on them and compiled for * hard float. */ if (0 == fpsetmask(fpsetmask(FP_X_INV))) atf_tc_skip("FPU does not implement exception handling"); #endif
ARM CPUs post version 5 can do unaligned data access.
Some of the signal tests explicitly try to trigger a SIGBUS for this, and failed on this CPUs.
Other architectures already provide a sysctl, sometimes even writable, to controll/detect this behavior, so this was added to ARM as well and tests grew code like:
#if defined(__alpha__) || defined(__arm__) int rv, val; size_t len = sizeof(val); rv = sysctlbyname("machdep.unaligned_sigbus", &val, &len, NULL, 0); ATF_REQUIRE(rv == 0); if (val == 0) atf_tc_skip("No SIGBUS signal for unaligned accesses"); #endif
In retrospect the issues found and fixed were less than expected. Most time was spent on typical problems when bringing up new hardware.
After basic testing works, the automatic tests (via ATF) proved very valuable again, but some issues did not get noticed - so there is further room for improvement.
Next challenge: Firefox on BE8 ARM.