ARM’d and dangerous: FreeBSD on Cavium ThunderX (aarch64)

While I don’t remember for how many years I’ve had an interest in CPU architectures that could be an alternative to AMD64, I know pretty well when I started proposing to test 64-bit ARM at work. It was shortly after the disaster named Spectre / Meltdown that I first dug out server-class ARM hardware and asked whether we should get one such server and run some tests with it.

While the answer wasn’t a clear “no” it also wasn’t exactly “yes”. I tried again a few times over the course of 2018 and each time I presented some more points why I thought it might be a good thing to test this. But still I wasn’t able to get a positive answer. Finally in January 2019 year I got a definitive answer – and it was “yes, go ahead”! The fact that Amazon had just presented their Graviton ARM Processor may have helped the decision.

Getting my hands on an ARM64 server

Eventually a Gigabyte R120-T32 was ordered and arrived at our Data Center in early February. It wasn’t perfect timing at all since there are quite a few important projects running currently which draw just about all the resources that we currently have. Still we put together a draft of an evaluation sheet with quite a few things to test. There are a lot of “yes / no” things and some that require performance measurements.

We set aside a few hours to put some drives into the machine, put it into the rack, configure the EFI and get BMC/IPMI working. We quickly found that there was no firmware update or anything available, so we should be good to go. However… The IPMI console (Java…) is not working at all. A few colleagues that have different Linux distros running on their workstations tried to access it – but no luck. It looks like a problem with too new Java versions (*sigh*). I didn’t even try it with my BSD workstation: I could never get Supermicro IPMIs to work – even on AMD64. So for me it’s Linux in Bhyve for that kind of task (if anybody has any idea on how to get IPMI console access working on FreeBSD please let me know!!).

Our common installation procedure for special servers involves a little device that stores CD/DVD images and presents a virtual optical drive to the machine. Since ARM64 machines usually don’t have optical drives anymore, there are no CD images available for that architecture. Fortunately it’s not that complicated to prepare a USB thumb drive instead.

Once that was done I was able to confirm that remote administration using SoL (“Serial over LAN”) worked, so I could start playing around with the new server it in my free time.

Hello Beastie!

Being our company’s “BSD guy”, it was pretty obvious that testing FreeBSD on the machine (even if it’s only for comparison reasons with Linux) would by my natural task. FreeBSD on ARM64 is currently a tier 2 platform. However it has been the plan for several years now to eventually make it tier 1 – i.e. a perfectly well supported platform. Even better: The Cavium ThunderX was selected as the reference platform for ARM64 and a partnership established between Cavium and the FreeBSD foundation! So doing a few tests there should be a piece of cake, right?

I booted the machine with the USB stick attached. Just as expected, there’s no working VGA console support (this is pretty much the default when working with ARM). The SoL connection worked, though and showed the nice beastie menu! “Off to a good start”, I thought. I was wrong.

The loader did its job of loading the kernel and… that’s where everything stopped. I rebooted again and again, trying to set loader tunables to get the system to boot (need I mention that the machine literally takes minutes to execute EFI stuff before it gets to the loader?). No luck whatsoever. The line “Entered …. EfiSelReportStatusCode Value: 3101019” was always the last thing printed to the screen. And also waiting for minutes the machine didn’t do anything afterwards.

“Maybe that status code shows that something is wrong?” I thought. A quick search on the net revealed a console log for NetBSD booting on the same hardware. It printed the same line but the kernel obviously did it’s thing. Weird – and quite discouraging. At this point I had no more ideas that I could try and was about to accept that it “just didn’t work”. Then I read a post to the FreeBSD-arm mailing list: Somebody else had the same problem! However it looked like nobody had a solution – there was not a single answer to that mail… But the post included one important piece of information: FreeBSD 11.2 did work?

FreeBSD 11.2

Ok, that would be a lot better than nothing. So I downloaded the 11.2 image, dd’d it to the stick and booted. This time there was no beastie (before 12.0 FreeBSD used a different loader on ARM) – but the kernel booted! I even got to the installer, Yay!

I did a fairly simple install like I often do it. The big difference was that I couldn’t configure the network. But let’s leave that for later, shall we? The installation completed and I rebooted. The loader appeared, the kernel booted and… Bam! The system is unable to import the pool! Too bad…

So I booted the stick again, did a second install choosing UFS instead of ZFS. This time FreeBSD would boot successfully into the new system and allow me to log in. Unfortunately no NICs were detected – and a server without network connectivity is rather… boring. What now?

After a bit of research on the net I learned about LIQUIDIO(4), the “Cavium 10Gb/25Gb Ethernet driver”. Sounded great. However the manpage says that it “first appeared in FreeBSD 12.0”. *sigh* Looks like I have to upgrade to 12.0 somehow.

FreeBSD 12?

I downloaded the source for the latest 12.0 from the releng branch, put it on a stick and attached it to the ARM server. That alone – I didn’t even get to import the pool – was enough to make the system panic! Oh great…

A UFS formatted stick then? Yes, that worked. I transferred the source, built the system, installed the kernel and saw that familiar line: “Entered …. EfiSelReportStatusCode Value: 3101019”. Obviously it’s not the images that are broken but actually FreeBSD 12.0 is…

What’s next? Let’s try 12-STABLE. Maybe the problem is fixed there. Unless… The installed kernel is broken and trying to boot kernel.old does not work! Oh excellent, it looks like “make installkernel” did not create any backup! This probably makes a lot of sense on small ARM devices, but it doesn’t make any in this case. Now I’m in for another reinstall.

After reinstalling 11.2 I did “cp /boot/kernel /boot/kernel.11” – just in case. Then I transferred the code for 12-STABLE onto the machine and built it. You would probably nave guessed it by now: It’s broken.

HEADaches

The next day at work I organized a USB to ethernet adapter and configured ue0 so that I could finally ssh into the OS. That felt a lot better and I could also use svnlite directly to checkout the source code. Quite some time later I had built and installed -CURRENT. But would it work? Nope!

Now this was all very unfortunate. But I happen to like FreeBSD and really want it to work on ARM64, too. Just waiting for somebody to fix it didn’t sound like a good strategy, however. The hardware is still pretty exotic and if just one person reported breakage in the final days of RCs for 12.0, it’s pretty clear that not too many developers have access to it to test things regularly… So I figured that I should try to find out exactly when HEAD broke. That would most likely be helpful information.

Alright… r302409 was the commit that turned HEAD into 12-CURRENT. Let’s make sure that this version still works – otherwise it would mean that 11.0 on ThunderX was fixed after branching off HEAD. Seems very unlikely, but making sure wouldn’t hurt, right?

usr/local/aarch64-freebsd/bin/ranlib -D libc_pic.a
--- libc.so.7.full ---
/usr/local/aarch64-freebsd/bin/ld: getutxent.So(.debug_info+0x3c): R_AARCH64_ABS64 used with TLS symbol udb                                                                                   
/usr/local/aarch64-freebsd/bin/ld: getutxent.So(.debug_info+0x59): R_AARCH64_ABS64 used with TLS symbol uf                                                                                    
/usr/local/aarch64-freebsd/bin/ld: utxdb.So(.debug_info+0x5c): R_AARCH64_ABS64 used with TLS symbol futx_to_utx.ut                                                                            
/usr/local/aarch64-freebsd/bin/ld: jemalloc_tsd.So(.debug_info+0x3d): R_AARCH64_ABS64 used with TLS symbol __je_tsd_tls                                                                       
/usr/local/aarch64-freebsd/bin/ld: jemalloc_tsd.So(.debug_info+0x1434): R_AARCH64_ABS64 used with TLS symbol __je_tsd_initialized                                                             
/usr/local/aarch64-freebsd/bin/ld: xlocale.So(.debug_info+0x404): R_AARCH64_ABS64 used with TLS symbol __thread_locale                                                                        
/usr/local/aarch64-freebsd/bin/ld: setrunelocale.So(.debug_info+0x3d): R_AARCH64_ABS64 used with TLS symbol _ThreadRuneLocale                                                                 
cc: error: linker command failed with exit code 1 (use -v to see invocation)
*** [libc.so.7.full] Error code 1

Perhaps building an older version of FreeBSD on a newer one is not such a good idea… So I chose to install 11.0 first. That worked without problems. However… 11.0 was cut before the switch to LLVM 4.0 – which brought in a usable lld! For that reason 11.0 needed an external toolchain. Of course that version is not supported anymore so there are no packages for it around. To really build anything on 11.0 would mean that I’d first have to cross-compile a binutils package for it! I admit that this wasn’t terribly attractive to me – especially since I already had a way different goal.

Ok… So HEAD likely broke somewhere between r302409 and r343815 – that’s roughly 40,000 commits to check! Let’s just hope that it broke after FreeBSD on ARM64 became self-hosting…

Checking r330000 – it works! r340000? Broken. r335000: Still works. r337500: Nope. After a lot of deleting previous builds and source, checking out other old revisions and rebuilding kernel-toolchain and kernel (and hitting some versions that didn’t build on ARM in the first place), I found out that it was commit r336520 that broke FreeBSD on ThunderX: The vt_efifb device was added to the GENERIC kernel on ARM64. According to the commit, it was tested on PINE64 but obviously it does not work with ThunderX, yet.

Problem solved?

So if it’s just one line in the GENERIC configuration it should not be any problem to fix this, right? To test I checked out HEAD again, removed the line and rebuilt the kernel. And after rebuilding and installing once more – it booted just fine!

Then I restored the GENERIC config and wrote a custom kernel configuration instead:

/usr/src/sys/arm64/conf/THUNDERX:

include GENERIC
ident THUNDERX

nodevice        vt_efifb

I wrote to the mailing list that I had identified a problem with ThunderX and made proposals on how to unbreak HEAD.

An answer pointed out that a custom kernel was not even required: It suffices to set the loader tunable hw.syscons.disable to make HEAD work again! No need to remove the device from GENERIC. That’s even better! But what I actually wanted was 12.0 or 12-STABLE and not HEAD. Installing that however, I found that it’s… well. Broken!

It broke twice!

What’s going on here? Could it be that HEAD broke twice and the other problem was fixed in a later revision? I had no better explanation. So let’s get compiling for another few nights again and see…

r339436 is when HEAD was renamed to 13-CURRENT. Would it work just setting the loader tunable? Nope. I played the same game as before to eventually figure out that r338537 was the offending commit here. It’s message says that it’s increasing two values because of ThunderX – but unfortunately actually that broke the platform.

On to more compiling… Finally here’s the other commit: r343764 changed those values again in preparation for ThunderX2 – and accidentally fixed ThunderX, too! Unfortunately that wasn’t until after 12.0 was branched off.

Again it’s a rather simple patch to make 12-STABLE work:

--- sys/arm/arm/physmem.c.orig  2019-02-17 08:47:05.675448000 +0100
+++ sys/arm/arm/physmem.c       2019-02-17 08:48:53.209050000 +0100
@@ -29,6 +29,7 @@
  #include 
  __FBSDID("$FreeBSD: stable/12/sys/arm/arm/physmem.c 341760  
2018-12-09 06:46:53Z mmel $");

+#include "opt_acpi.h"
  #include "opt_ddb.h"

  /*
@@ -48,8 +49,13 @@
   * that can be allocated, or both, depending on the exclusion flags  
associated
   * with the region.
   */
+#ifdef DEV_ACPI
+#define MAX_HWCNT       32      /* ACPI needs more regions */
+#define MAX_EXCNT       32
+#else
  #define        MAX_HWCNT       16
  #define        MAX_EXCNT       16
+#endif

  #if defined(__arm__)
  #define        MAX_PHYS_ADDR   0xFFFFFFFFull

Fixing 12?

After I had a working 12 system, I wrote to the mailing list again, asking if the change could be MFC’d from r343764 into 12-STABLE.

Rod Grimes was nice enough to ping the developer who had made the commit. So far he hasn’t replied but I really hope that it’ll be possible to MFC the change. Because otherwise that would mean all of FreeBSD 12.x would remain broken – and that would definitely not be cool… If it is possible we’re likely to have a working 12.1 image when that version is out. 12.0 will remain broken, I guess.

Until we have a working version again, you can either install using 11.2 and build from (currently patched) source. Or you can try a fixed image that I build (if you trust me). What I did to build was this:

  1. svnlite co svn://svn.freebsd.org/base/stable/12 /usr/src
  2. cd /usr/src
  3. patch -i patchfile
  4. make -j 49 buildworld
  5. make -j 49 buildkernel
  6. edit stand/defaults/loader.conf to set hw.syscons.disable
  7. cd release
  8. make memstick NOPKG=1 NOPORTS=1 NOSRC=1 NODOC=1 WITH_COMPRESSED_IMAGES=1

Conclusion

Tier 1? Obviously we’re not quite there, yet. Booting off of ZFS does not work. There are no binary updates. Even with 12-STABLE I’ve been unable to get the NICs working – and you wouldn’t want to run a server with 4x10Gbit and one 40Gbit NICs using just a 1Gbit USB to ethernet adapter.

On the plus side, FreeBSD generally works and seems stable with UFS. Building from source using 49 threads works and so does building software from ports. I even installed Synth and ran the “build-everything” for over a day to put some serious load on the machine without any problems.

I’d like to at least get the network working, too, before that machine will turn into a penguin for production. Maybe I can set aside a few more hours at night. If I end up with anything noteworthy, I’ll follow-up on this post. Oh, and if anybody has any information on making it work, please let me know!