In the words of Dr Cathy Ryan, "If you don't write it down, it never happened".
To paraphrase one of my clients, "Every day is a school day".
I do, I learn, I share
The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.
My blog is PERSONAL, and is a repository of the stuff that I learn, play with, enjoy and want to share.
If you follow one of my tips, your mileage MAY well vary - Here be dragons :-)
Monday, 31 July 2017
Solved, WebSphere eats Linux - or, Linux reboots when WAS JVM starts up
So I've been working an interesting "bug" over the past few days.
The long story short is that we had a VM running Red Hat Enterprise Linux 6.6 ( with kernel 2.6.32-696.3.1.el6.x86_64 ), hosting WebSphere Application Server 126.96.36.199 and BPM Standard 188.8.131.52.
BPM is installed a single cluster, with all of the workload running in a single JVM - in this instance.
Whilst I could happily start the Deployment Manager and Node Agent, when I started the actual BPM JVM, after a minute or two, the box would reboot.
This was 100% reproducible.
I spent a happy day last week debugging this, looking at class paths, dependent JARs, auto starting EAR files etc. but to no avail.
I even wondered whether the fact that we were running slightly older versions of WAS and BPM was pertinent, even though I've used them a million times before.
Interestingly, nothing showed up in dmesg or /var/log/messages or /var/log/kernel.log and there was no obvious kernel panic messages therein.
Thankfully, with the help of a VMware SME, we did find this : -
This issue affects all virtual machines running on ESXi 6.5 host (with virtual hardware version 13), the guest will freeze randomly (sometimes several minutes after power on, and sometimes freezes several hours from boot).
I got this kernel panic log several times, possibly this issue was caused by VMXNET3.
All kernel newer than 4.8.x are affected with this issue, if I downgrade the kernel version back to 4.4.x, the VMs will work like a charm.
(Guest OS is CentOS 7.3 with kernel-ml)
And this issue doesn't happen while virtual hardware version 11 with ESXi 6.5, only happen on virtual hardware version 13 + ESXi 6.5.
My friendly VMware SME spotted this, and suggested that we switch the virtual Network Interface Card (vNIC) to use a different driver - namely e1000.
We did this and …. voila :-)
The BPM runtime starts happily and all is good.
This appears to be related to the specific version of VMware ESX ( aka vSphere ) and the fact that the Linux VM is newly created, using the most recent virtual hardware version.