Monday, 31 July 2017

Solved, WebSphere eats Linux - or, Linux reboots when WAS JVM starts up

So I've been working an interesting "bug" over the past few days.

The long story short is that we had a VM running Red Hat Enterprise Linux 6.6 ( with kernel 2.6.32-696.3.1.el6.x86_64 ), hosting WebSphere Application Server 8.5.5.8 and BPM Standard 8.5.7.0.

BPM is installed a single cluster, with all of the workload running in a single JVM - in this instance.

Whilst I could happily start the Deployment Manager and Node Agent, when I started the actual BPM JVM, after a minute or two, the box would reboot.

This was 100% reproducible.

I spent a happy day last week debugging this, looking at class paths, dependent JARs, auto starting EAR files etc. but to no avail.

I even wondered whether the fact that we were running slightly older versions of WAS and BPM was pertinent, even though I've used them a million times before.

Interestingly, nothing showed up in dmesg or /var/log/messages or /var/log/kernel.log and there was no obvious kernel panic messages therein.

Thankfully, with the help of a VMware SME, we did find this : -

tail -f /var/crash/127.0.0.1-2017-07-31-10:43:30

...
<2>kernel BUG at drivers/net/vmxnet3/vmxnet3_drv.c:1412!
<4>invalid opcode: 0000 [#1] SMP
<4>last sysfs file: /sys/devices/system/cpu/online


which tied up with this: -


which says, in part : -

<snip>
This issue affects all virtual machines running on ESXi 6.5 host (with virtual hardware version 13), the guest will freeze randomly (sometimes several minutes after power on, and sometimes freezes several hours from boot).

I got this kernel panic log several times, possibly this issue was caused by VMXNET3.

All kernel newer than 4.8.x are affected with this issue, if I downgrade the kernel version back to 4.4.x, the VMs will work like a charm.

(Guest OS is CentOS 7.3 with kernel-ml)

And this issue doesn't happen while virtual hardware version 11 with ESXi 6.5, only happen on virtual hardware version 13 + ESXi 6.5.
</snip>

My friendly VMware SME spotted this, and suggested that we switch the virtual Network Interface Card (vNIC) to use a different driver - namely e1000.

We did this and …. voila :-)

The BPM runtime starts happily and all is good.

This appears to be related to the specific version of VMware ESX ( aka vSphere ) and the fact that the Linux VM is newly created, using the most recent virtual hardware version.

So that's all good then ….

No comments: