Todos estos pasos descriptos fueron probados en ambientes productivos

miércoles, 21 de septiembre de 2016

Problem with LDMD  daemon and the solution (spanish version)

In this article , we describe with my collegue  Nicolas Morono,  a bug with ldmd daemon and how to restore the previous configuration of the Logical Domains  ( LDOMs ) using ldm-db.xml file

When we wanted assign a lun to a LDOM, we find with this trouble :

# ldm list
Failed to connect to logical domain manager: Connection refused

We check and the service ldmd is in maintenance state
svcs -xv
svc:/ldoms/ldmd:default (Logical Domains Manager)
State: maintenance since June 2, 2016 06:36:16 PM ART
Reason: Start method exited with $SMF_EXIT_ERR_FATAL.
See: /var/svc/log/ldoms-ldmd:default.log
Impact: This service is not running.

In the  /var/adm/messages it showed this errors

Jun 2 18:36:16 m5-1-pdom2 svc.startd[33]: [ID 652011 daemon.warning] svc:/ldoms/ldmd:default: Method "/opt/SUNWldm/bin/ldmd_start" failed with exit status 95.
Jun 2 18:36:16 m5-1-pdom2 svc.startd[33]: [ID 748625 daemon.error] ldoms/ldmd:default failed fatally: transitioned to maintenance (see 'svcs -xv' for details)
Jun 2 18:36:16 m5-1-pdom2 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: SMF-8000-YX, TYPE: defect, VER: 1, SEVERITY: major
Jun 2 18:36:16 m5-1-pdom2 EVENT-TIME: Thu Jun 2 18:36:16 ART 2016
Jun 2 18:36:16 m5-1-pdom2 PLATFORM: SPARC-M5-32, CSN: AK00xx8x1, HOSTNAME: m5-1-pdom2
Jun 2 18:36:16 m5-1-pdom2 SOURCE: software-diagnosis, REV: 0.1
Jun 2 18:36:16 m5-1-pdom2 EVENT-ID: 889f64a0-0102-efd6-997f-8e83e7fba09a
Jun 2 18:36:16 m5-1-pdom2 DESC: A service failed - a start, stop or refresh method failed.
Jun 2 18:36:16 m5-1-pdom2 AUTO-RESPONSE: The service has been placed into the maintenance state.
Jun 2 18:36:16 m5-1-pdom2 IMPACT: svc:/ldoms/ldmd:default is unavailable.
Jun 2 18:36:16 m5-1-pdom2 REC-ACTION: Run 'svcs -xv svc:/ldoms/ldmd:default' to determine the generic reason why the service failed, the location of any logfiles, and a list of other services impacted. Please refer to the associated reference document at for the latest service procedures and policies regarding this diagnosis.
Jun 2 18:40:28 m5-1-pdom2 cmlb: [ID 107833 1

We check in the svc logs  

cat /var/svc/log/ldoms-ldmd:default.log
Jun 02 18:35:16 timeout waiting for op HVctl_op_get_bulk_res_stat
Jun 02 18:35:16 fatal error: waiting for hv response timeout

[ Jun 2 18:35:16 Stopping because process dumped core. ]
[ Jun 2 18:35:16 Executing stop method (:kill). ]
[ Jun 2 18:35:16 Executing start method ("/opt/SUNWldm/bin/ldmd_start"). ]
Jun 02 18:36:16 timeout waiting for op HVctl_op_hello
Jun 02 18:36:16 fatal error: waiting for hv response timeout

[ Jun 2 18:36:16 Method "start" exited with status 95. ]

We looked at the oracle docs and came to the conclusion that there was a  bug  in firmware versions below  1. 14.2  which matched our environment.
We opened a service request to confirm the analyzed by us and the proposed solution was the same.

The bug is in Hypervisors lower than the version 1.14.2 .

- The short term solution is to perform a power-cycle the system.
- The solution to medium / long term is to update the system firmware to a recent version ( HypV 1.14.2 or Higher )

At this point we find that solutions involve a power cycle that involves all running LDOMS and total reboot of the machine.
We decided to perform the firmware upgrade and make the power-cycle, but we realized that the last saved settings LDOMS is old and we  will lose 6 months changes in LDOMs configurations ( like creation of new LDOMs , disk assignments, allocation of network cards, etc )

The solution applied to solved this situation was as follow :

Prior to reboot the PDOM, we backup the file  ldom-db.xml  located in  /var/opt/SUNWldm , ( this file make the Magic ) this file has all the settings that are active in PDOM regardless of whether or not you saved in the SP .
We copy this file ( ldom-db.xml ) in /usr/scripts , to use after easily without a restore from the backup

Here are the steps used 
From the ilom
We make the power-cycle 
stop Servers/PDomains/PDomain_2/HOST 
y then
start Servers/PDomains/PDomain_2/HOST

Once we Boot the PDOM and with the LDOMs down and  unbind,  we take a backup of the file ldom-db.xml  and disable the ldom service daemon.

root@ # ldm ls
primary          active     -n-cv-  UART    8     16G      0.2%  0.2%  8d 2h 38m
dnet1002         active     -n----  5002    8     8G       0.5%  0.5%  5d 2h 49m
dsunt100         active     -n----  5000    48    40G      0.0%  0.0%  8d 1h 34m
dsunt200         active     -n----  5001    48    40G      0.0%  0.0%  2m

root@ # ldm stop dsunt200
LDom dsunt200 stopped
root@ # ldm unbind dsunt200

root@ # ldm stop dsunt100
LDom dsunt100 stopped
root@ # ldm unbind dsunt100

root@ # ldm stop dnet1002
LDom dnet1002 stopped
root@ # ldm unbind dnet1002

root@ # ldm ls
primary          active     -n-cv-  UART    8     16G      0.5%  0.5%  8d 2h 40m
dnet1002         inactive     ------      8     8G       
dsunt100         inactive    ------      48    40G      
dsunt200         inactive   ------       48    40G
root@ #

cd /var/opt/SUNWldm
cp -p ldom-db.xml ldom-db.xml.orig
svcadm disable ldmd

##### Here we use the file stored previoulsy in /usr/scripts/,  Now we overwrite the original stored in  /var/opt/SUNWldm
cp -p /usr/scripts/ldom-db.xml /var/opt/SUNWldm/ldom-db.xml        

Enable the ldmd service.
svcadm enable ldmd

### We check the configuration to see if everythings is OK, bind and start of ldoms .
Then, we make an init 6 and after that .. bind and start to all ldoms like we show you next

root@ # ldm bind dsunt200
root@ # ldm start dsunt200
LDom dsunt200 started
root@ # ldm bind dsunt100
root@ # ldm start dsunt100
LDom dsunt100 started
root@ # ldm bind dnet1002
root@ # ldm start dnet1002
LDom dnet1002 started

root@ # ldm ls
primary          active     -n-cv-  UART    8     16G      3.7%  3.7%  8d 2h 55m
dnet1002         active     -n----  5002    8     8G       0.7%  0.7%  3s
dsunt100         active     -n----  5000    48    40G      0.0%  0.0%  2s
dsunt200         active     -n----  5001    48    40G      9.1%  1.0%  2s
root@ #

PS : Please forgive my english  ;-) 

5 comentarios:

  1. Respuestas
    1. Diego, hi. Good night. I work at CLARO S.A, in Brazil, Porto Alegre. I work with Solaris 10/11 and Oracle VM in SPARC enviroment. Do you have any workgroup or something to suggest that i can participate to learn more about? Thank you.

    2. Hola Lucas Como vai voce ? Can you send an email to . obrigado