In this article , we describe with my collegue Nicolas Morono, a bug with ldmd daemon and how to restore the previous configuration of the Logical Domains ( LDOMs ) using ldm-db.xml file
When we wanted assign a lun to a LDOM, we find with this trouble :
# ldm list
Failed to connect to logical domain manager: Connection refused
We check and the service ldmd is in maintenance state
# svcs -xv
svc:/ldoms/ldmd:default (Logical Domains Manager)
State: maintenance since June 2, 2016 06:36:16 PM ART
Reason: Start method exited with $SMF_EXIT_ERR_FATAL.
See: /var/svc/log/ldoms-ldmd:default.log
Impact: This service is not running.
In the /var/adm/messages it showed this errors
Jun 2 18:36:16 m5-1-pdom2 svc.startd[33]: [ID 652011 daemon.warning] svc:/ldoms/ldmd:default: Method "/opt/SUNWldm/bin/ldmd_start" failed with exit status 95.
Jun 2 18:36:16 m5-1-pdom2 svc.startd[33]: [ID 748625 daemon.error] ldoms/ldmd:default failed fatally: transitioned to maintenance (see 'svcs -xv' for details)
Jun 2 18:36:16 m5-1-pdom2 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: SMF-8000-YX, TYPE: defect, VER: 1, SEVERITY: major
Jun 2 18:36:16 m5-1-pdom2 EVENT-TIME: Thu Jun 2 18:36:16 ART 2016
Jun 2 18:36:16 m5-1-pdom2 PLATFORM: SPARC-M5-32, CSN: AK00xx8x1, HOSTNAME: m5-1-pdom2
Jun 2 18:36:16 m5-1-pdom2 SOURCE: software-diagnosis, REV: 0.1
Jun 2 18:36:16 m5-1-pdom2 EVENT-ID: 889f64a0-0102-efd6-997f-8e83e7fba09a
Jun 2 18:36:16 m5-1-pdom2 DESC: A service failed - a start, stop or refresh method failed.
Jun 2 18:36:16 m5-1-pdom2 AUTO-RESPONSE: The service has been placed into the maintenance state.
Jun 2 18:36:16 m5-1-pdom2 IMPACT: svc:/ldoms/ldmd:default is unavailable.
Jun 2 18:36:16 m5-1-pdom2 REC-ACTION: Run 'svcs -xv svc:/ldoms/ldmd:default' to determine the generic reason why the service failed, the location of any logfiles, and a list of other services impacted. Please refer to the associated reference document athttp://support.oracle.com/msg/SMF-8000-YX for the latest service procedures and policies regarding this diagnosis.
Jun 2 18:40:28 m5-1-pdom2 cmlb: [ID 107833 1
We check in the svc logs
# cat /var/svc/log/ldoms-ldmd:default.log
Jun 02 18:35:16 timeout waiting for op HVctl_op_get_bulk_res_stat
Jun 02 18:35:16 fatal error: waiting for hv response timeout
[ Jun 2 18:35:16 Stopping because process dumped core. ]
[ Jun 2 18:35:16 Executing stop method (:kill). ]
[ Jun 2 18:35:16 Executing start method ("/opt/SUNWldm/bin/ldmd_start"). ]
Jun 02 18:36:16 timeout waiting for op HVctl_op_hello
Jun 02 18:36:16 fatal error: waiting for hv response timeout
[ Jun 2 18:36:16 Method "start" exited with status 95. ]
We looked at the oracle docs and came to the conclusion that there was a bug in firmware versions below 1. 14.2 which matched our environment.
We opened a service request to confirm the analyzed by us and the proposed solution was the same.
The bug is in Hypervisors lower than the version 1.14.2 .
- The short term solution is to perform a power-cycle the system.
- The solution to medium / long term is to update the system firmware to a recent version ( HypV 1.14.2 or Higher )
At this point we find that solutions involve a power cycle that involves all running LDOMS and total reboot of the machine.
We decided to perform the firmware upgrade and make the power-cycle, but we realized that the last saved settings LDOMS is old and we will lose 6 months changes in LDOMs configurations ( like creation of new LDOMs , disk assignments, allocation of network cards, etc )
The solution applied to solved this situation was as follow :
Prior to reboot the PDOM, we backup the file ldom-db.xml located in /var/opt/SUNWldm , ( this file make the Magic ) this file has all the settings that are active in PDOM regardless of whether or not you saved in the SP .
We copy this file ( ldom-db.xml ) in /usr/scripts , to use after easily without a restore from the backup
Here are the steps used
From the ilom
We make the power-cycle
stop Servers/PDomains/PDomain_2/HOST
y then
start Servers/PDomains/PDomain_2/HOST
Once we Boot the PDOM and with the LDOMs down and unbind, we take a backup of the file ldom-db.xml and disable the ldom service daemon.
root@ # ldm ls
NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
primary active -n-cv- UART 8 16G 0.2% 0.2% 8d 2h 38m
dnet1002 active -n---- 5002 8 8G 0.5% 0.5% 5d 2h 49m
dsunt100 active -n---- 5000 48 40G 0.0% 0.0% 8d 1h 34m
dsunt200 active -n---- 5001 48 40G 0.0% 0.0% 2m
root@#
root@ # ldm stop dsunt200
LDom dsunt200 stopped
root@ # ldm unbind dsunt200
root@ # ldm stop dsunt100
LDom dsunt100 stopped
root@ # ldm unbind dsunt100
root@ # ldm stop dnet1002
LDom dnet1002 stopped
root@ # ldm unbind dnet1002
root@ # ldm ls
NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
primary active -n-cv- UART 8 16G 0.5% 0.5% 8d 2h 40m
dnet1002 inactive ------ 8 8G
dsunt100 inactive ------ 48 40G
dsunt200 inactive ------ 48 40G
root@ #
cd /var/opt/SUNWldm
cp -p ldom-db.xml ldom-db.xml.orig
svcadm disable ldmd
##### Here we use the file stored previoulsy in /usr/scripts/, Now we overwrite the original stored in /var/opt/SUNWldm
cp -p /usr/scripts/ldom-db.xml /var/opt/SUNWldm/ldom-db.xml
# Enable the ldmd service.
svcadm enable ldmd
### We check the configuration to see if everythings is OK, bind and start of ldoms .
Then, we make an init 6 and after that .. bind and start to all ldoms like we show you next
root@ # ldm bind dsunt200
root@ # ldm start dsunt200
LDom dsunt200 started
root@ # ldm bind dsunt100
root@ # ldm start dsunt100
LDom dsunt100 started
root@ # ldm bind dnet1002
root@ # ldm start dnet1002
LDom dnet1002 started
root@ # ldm ls
NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
primary active -n-cv- UART 8 16G 3.7% 3.7% 8d 2h 55m
dnet1002 active -n---- 5002 8 8G 0.7% 0.7% 3s
dsunt100 active -n---- 5000 48 40G 0.0% 0.0% 2s
dsunt200 active -n---- 5001 48 40G 9.1% 1.0% 2s
root@ #
PS : Please forgive my english ;-)