
Todos estos pasos descriptos fueron probados en ambientes productivos

martes, 27 de septiembre de 2016

Como Agregar placa fibra a un sparc t7-1

En este breve instructivo mostramos como insertar una placa de fibra para SAN , valido para cualquier placa pci.
La instalacion en este caso especifico corresponde a 1 Sun Storage Dual 16 Gb Fibre Channel PCIe Universal HBA, Qlogic

Para instalar o remover una placa de fibra o fiber channel, el equipo tiene que estar apagado , sin energia electrica.
Para esto vamos a conectar un cable serial a la SP del T7

Con minicom o putty  nos conectamos a la SP  y ejecutamos un stop -f /System y un show /System para chequear que el power_state este en OFF

Desconectamos los cables de la fuente de energia ( power cords)
Ahora comenzamos con la apertura del server, como vemos a continuacion,

Para poder insertar o remover una placa pci, es necesario abrir el mecanismo de locking que vemos a continuacion

En la vida real seria este :

Ahora insertamos la placa de fibra

Perfecto !!! Ahora a volver a armar las tapas ( top cover ), deslizamos el server hacia atras volviendo a su posicion original en el rack, presionando las trabas verdes de los laterales y conectamos los cables a las fuentes, dando energia al equipo ( power cords)

Encendemos el equipo. Nos conectamos a la SP y hacemos un start /System y un start /HOST/console

Al encenderlo, se toma varios minutos para correr un POST , con un diagnostico extendido.

miércoles, 21 de septiembre de 2016

Problem with LDMD  daemon and the solution (spanish version)

In this article , we describe with my collegue  Nicolas Morono,  a bug with ldmd daemon and how to restore the previous configuration of the Logical Domains  ( LDOMs ) using ldm-db.xml file

When we wanted assign a lun to a LDOM, we find with this trouble :

# ldm list
Failed to connect to logical domain manager: Connection refused

We check and the service ldmd is in maintenance state
svcs -xv
svc:/ldoms/ldmd:default (Logical Domains Manager)
State: maintenance since June 2, 2016 06:36:16 PM ART
Reason: Start method exited with $SMF_EXIT_ERR_FATAL.
See: /var/svc/log/ldoms-ldmd:default.log
Impact: This service is not running.

In the  /var/adm/messages it showed this errors

Jun 2 18:36:16 m5-1-pdom2 svc.startd[33]: [ID 652011 daemon.warning] svc:/ldoms/ldmd:default: Method "/opt/SUNWldm/bin/ldmd_start" failed with exit status 95.
Jun 2 18:36:16 m5-1-pdom2 svc.startd[33]: [ID 748625 daemon.error] ldoms/ldmd:default failed fatally: transitioned to maintenance (see 'svcs -xv' for details)
Jun 2 18:36:16 m5-1-pdom2 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: SMF-8000-YX, TYPE: defect, VER: 1, SEVERITY: major
Jun 2 18:36:16 m5-1-pdom2 EVENT-TIME: Thu Jun 2 18:36:16 ART 2016
Jun 2 18:36:16 m5-1-pdom2 PLATFORM: SPARC-M5-32, CSN: AK00xx8x1, HOSTNAME: m5-1-pdom2
Jun 2 18:36:16 m5-1-pdom2 SOURCE: software-diagnosis, REV: 0.1
Jun 2 18:36:16 m5-1-pdom2 EVENT-ID: 889f64a0-0102-efd6-997f-8e83e7fba09a
Jun 2 18:36:16 m5-1-pdom2 DESC: A service failed - a start, stop or refresh method failed.
Jun 2 18:36:16 m5-1-pdom2 AUTO-RESPONSE: The service has been placed into the maintenance state.
Jun 2 18:36:16 m5-1-pdom2 IMPACT: svc:/ldoms/ldmd:default is unavailable.
Jun 2 18:36:16 m5-1-pdom2 REC-ACTION: Run 'svcs -xv svc:/ldoms/ldmd:default' to determine the generic reason why the service failed, the location of any logfiles, and a list of other services impacted. Please refer to the associated reference document at for the latest service procedures and policies regarding this diagnosis.
Jun 2 18:40:28 m5-1-pdom2 cmlb: [ID 107833 1

We check in the svc logs  

cat /var/svc/log/ldoms-ldmd:default.log
Jun 02 18:35:16 timeout waiting for op HVctl_op_get_bulk_res_stat
Jun 02 18:35:16 fatal error: waiting for hv response timeout

[ Jun 2 18:35:16 Stopping because process dumped core. ]
[ Jun 2 18:35:16 Executing stop method (:kill). ]
[ Jun 2 18:35:16 Executing start method ("/opt/SUNWldm/bin/ldmd_start"). ]
Jun 02 18:36:16 timeout waiting for op HVctl_op_hello
Jun 02 18:36:16 fatal error: waiting for hv response timeout

[ Jun 2 18:36:16 Method "start" exited with status 95. ]

We looked at the oracle docs and came to the conclusion that there was a  bug  in firmware versions below  1. 14.2  which matched our environment.
We opened a service request to confirm the analyzed by us and the proposed solution was the same.

The bug is in Hypervisors lower than the version 1.14.2 .

- The short term solution is to perform a power-cycle the system.
- The solution to medium / long term is to update the system firmware to a recent version ( HypV 1.14.2 or Higher )

At this point we find that solutions involve a power cycle that involves all running LDOMS and total reboot of the machine.
We decided to perform the firmware upgrade and make the power-cycle, but we realized that the last saved settings LDOMS is old and we  will lose 6 months changes in LDOMs configurations ( like creation of new LDOMs , disk assignments, allocation of network cards, etc )

The solution applied to solved this situation was as follow :

Prior to reboot the PDOM, we backup the file  ldom-db.xml  located in  /var/opt/SUNWldm , ( this file make the Magic ) this file has all the settings that are active in PDOM regardless of whether or not you saved in the SP .
We copy this file ( ldom-db.xml ) in /usr/scripts , to use after easily without a restore from the backup

Here are the steps used 
From the ilom
We make the power-cycle 
stop Servers/PDomains/PDomain_2/HOST 
y then
start Servers/PDomains/PDomain_2/HOST

Once we Boot the PDOM and with the LDOMs down and  unbind,  we take a backup of the file ldom-db.xml  and disable the ldom service daemon.

root@ # ldm ls
primary          active     -n-cv-  UART    8     16G      0.2%  0.2%  8d 2h 38m
dnet1002         active     -n----  5002    8     8G       0.5%  0.5%  5d 2h 49m
dsunt100         active     -n----  5000    48    40G      0.0%  0.0%  8d 1h 34m
dsunt200         active     -n----  5001    48    40G      0.0%  0.0%  2m

root@ # ldm stop dsunt200
LDom dsunt200 stopped
root@ # ldm unbind dsunt200

root@ # ldm stop dsunt100
LDom dsunt100 stopped
root@ # ldm unbind dsunt100

root@ # ldm stop dnet1002
LDom dnet1002 stopped
root@ # ldm unbind dnet1002

root@ # ldm ls
primary          active     -n-cv-  UART    8     16G      0.5%  0.5%  8d 2h 40m
dnet1002         inactive     ------      8     8G       
dsunt100         inactive    ------      48    40G      
dsunt200         inactive   ------       48    40G
root@ #

cd /var/opt/SUNWldm
cp -p ldom-db.xml ldom-db.xml.orig
svcadm disable ldmd

##### Here we use the file stored previoulsy in /usr/scripts/,  Now we overwrite the original stored in  /var/opt/SUNWldm
cp -p /usr/scripts/ldom-db.xml /var/opt/SUNWldm/ldom-db.xml        

Enable the ldmd service.
svcadm enable ldmd

### We check the configuration to see if everythings is OK, bind and start of ldoms .
Then, we make an init 6 and after that .. bind and start to all ldoms like we show you next

root@ # ldm bind dsunt200
root@ # ldm start dsunt200
LDom dsunt200 started
root@ # ldm bind dsunt100
root@ # ldm start dsunt100
LDom dsunt100 started
root@ # ldm bind dnet1002
root@ # ldm start dnet1002
LDom dnet1002 started

root@ # ldm ls
primary          active     -n-cv-  UART    8     16G      3.7%  3.7%  8d 2h 55m
dnet1002         active     -n----  5002    8     8G       0.7%  0.7%  3s
dsunt100         active     -n----  5000    48    40G      0.0%  0.0%  2s
dsunt200         active     -n----  5001    48    40G      9.1%  1.0%  2s
root@ #

PS : Please forgive my english  ;-) 

miércoles, 14 de septiembre de 2016

Problema con LDMD y solucion aplicada

( english version)
En este documento , describimos con mi compañero Nicolas Morono,  un problema con el demonio ldmd  y como lo recuperar la configuracion de los LDOMs  desde el archivo ldm-db.xml

Al querer asignarle una lun a un dominio nos dio el siguiente error :

# ldm list
Failed to connect to logical domain manager: Connection refused

chequeo y el servicio ldmd esta caido
# svcs -xv
svc:/ldoms/ldmd:default (Logical Domains Manager)
State: maintenance since June 2, 2016 06:36:16 PM ART
Reason: Start method exited with $SMF_EXIT_ERR_FATAL.
See: /var/svc/log/ldoms-ldmd:default.log
Impact: This service is not running.

en el /var/adm/messages estan estos errores registrados

Jun 2 18:36:16 m5-1-pdom2 svc.startd[33]: [ID 652011 daemon.warning] svc:/ldoms/ldmd:default: Method "/opt/SUNWldm/bin/ldmd_start" failed with exit status 95.
Jun 2 18:36:16 m5-1-pdom2 svc.startd[33]: [ID 748625 daemon.error] ldoms/ldmd:default failed fatally: transitioned to maintenance (see 'svcs -xv' for details)
Jun 2 18:36:16 m5-1-pdom2 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: SMF-8000-YX, TYPE: defect, VER: 1, SEVERITY: major
Jun 2 18:36:16 m5-1-pdom2 EVENT-TIME: Thu Jun 2 18:36:16 ART 2016
Jun 2 18:36:16 m5-1-pdom2 PLATFORM: SPARC-M5-32, CSN: AK00xx8x1, HOSTNAME: m5-1-pdom2
Jun 2 18:36:16 m5-1-pdom2 SOURCE: software-diagnosis, REV: 0.1
Jun 2 18:36:16 m5-1-pdom2 EVENT-ID: 889f64a0-0102-efd6-997f-8e83e7fba09a
Jun 2 18:36:16 m5-1-pdom2 DESC: A service failed - a start, stop or refresh method failed.
Jun 2 18:36:16 m5-1-pdom2 AUTO-RESPONSE: The service has been placed into the maintenance state.
Jun 2 18:36:16 m5-1-pdom2 IMPACT: svc:/ldoms/ldmd:default is unavailable.
Jun 2 18:36:16 m5-1-pdom2 REC-ACTION: Run 'svcs -xv svc:/ldoms/ldmd:default' to determine the generic reason why the service failed, the location of any logfiles, and a list of other services impacted. Please refer to the associated reference document at for the latest service procedures and policies regarding this diagnosis.
Jun 2 18:40:28 m5-1-pdom2 cmlb: [ID 107833 1

En el log del svc  estan estos errores

# cat /var/svc/log/ldoms-ldmd:default.log
Jun 02 18:35:16 timeout waiting for op HVctl_op_get_bulk_res_stat
Jun 02 18:35:16 fatal error: waiting for hv response timeout

[ Jun 2 18:35:16 Stopping because process dumped core. ]
[ Jun 2 18:35:16 Executing stop method (:kill). ]
[ Jun 2 18:35:16 Executing start method ("/opt/SUNWldm/bin/ldmd_start"). ]
Jun 02 18:36:16 timeout waiting for op HVctl_op_hello
Jun 02 18:36:16 fatal error: waiting for hv response timeout

[ Jun 2 18:36:16 Method "start" exited with status 95. ]

Se busco en los documentos de oracle y llegamos a la conclusion de que habia un bug en versiones de firmware inferiores a 1.14.2 que se correspondia con nuestro entorno.
Se abrio un service request para confirmar lo analizado por nosotros y  la solucion propuesta fue la misma.

En si, el bug es en Hypervisors inferiores a la version 1.14.2.
- La solución a corto plazo es realizar un power-cycle del sistema.
- La solución a mediano/largo plazo es actualizar el Firmware del sistema a una version reciente (HypV 1.14.2 o superior). 

Llegado a este punto nos encontramos que las soluciones implican un power-cycle que implica bajar todos los ldoms corriendo y el reinicio total de la maquina.
Optamos por realizar el upgrade de firmware y al realizar el powercycle nos damos cuenta que la ultima configuracion de ldoms guardada es vieja y vamos a perder 6 meses de
modificaciones en las configuraciones de los ldoms. (creaciones de ldoms nuevos, asignaciones de discos, asignacion de placas de red, etc)

La solucion aplicada para solucionar esto fue la siguiente:

Se busco en /var/opt/SUNWldm el archivo ldom-db.xml previo a bootear el equipo, ese archivo tiene toda lo configuracion que esta activa en el pdom
independientemente de si esta o no guardada en la sp.
Se dejo una copia del archivo en /usr/scripts (asi no era necesario realizar un restore del backup )

Se realiza el power-cycle desde ilom 
stop Servers/PDomains/PDomain_2/HOST 
y luego 
start Servers/PDomains/PDomain_2/HOST

Una vez booteado el equipo y con los ldoms bajos y unbind, tomamos backup del archivo ldom-db.xml y deshabilitamos el demonio de ldoms.

root@ # ldm ls
primary          active     -n-cv-  UART    8     16G      0.2%  0.2%  8d 2h 38m
dnet1002         active     -n----  5002    8     8G       0.5%  0.5%  5d 2h 49m
dsunt100         active     -n----  5000    48    40G      0.0%  0.0%  8d 1h 34m
dsunt200         active     -n----  5001    48    40G      0.0%  0.0%  2m

root@ # ldm stop dsunt200
LDom dsunt200 stopped
root@ # ldm unbind dsunt200

root@ # ldm stop dsunt100
LDom dsunt100 stopped
root@ # ldm unbind dsunt100

root@ # ldm stop dnet1002
LDom dnet1002 stopped
root@ # ldm unbind dnet1002

root@ # ldm ls
primary          active     -n-cv-  UART    8     16G      0.5%  0.5%  8d 2h 40m
dnet1002         inactive     ------      8     8G       
dsunt100         inactive    ------      48    40G      
dsunt200         inactive   ------       48    40G
root@ #

cd /var/opt/SUNWldm
cp -p ldom-db.xml ldom-db.xml.orig
svcadm disable ldmd

##### aca habia backupeado y dejado en /usr/scripts/ el archivo con la data. Ahora pisamos el original de /var/opt/SUNWldm
cp -p /usr/scripts/ldom-db.xml /var/opt/SUNWldm/ldom-db.xml        

# Volvemos a habilitar el demonio.
svcadm enable ldmd

### chequeo la config , bind y start de los dominios si quedaron ok. Aca luego de chequear la configuracion se le dio un init 6 para que reinicie normalmente

y despues de eso se dio bind y start a los dominios.

root@ # ldm bind dsunt200
root@ # ldm start dsunt200
LDom dsunt200 started
root@ # ldm bind dsunt100
root@ # ldm start dsunt100
LDom dsunt100 started
root@ # ldm bind dnet1002
root@ # ldm start dnet1002
LDom dnet1002 started

root@ # ldm ls
primary          active     -n-cv-  UART    8     16G      3.7%  3.7%  8d 2h 55m
dnet1002         active     -n----  5002    8     8G       0.7%  0.7%  3s
dsunt100         active     -n----  5000    48    40G      0.0%  0.0%  2s
dsunt200         active     -n----  5001    48    40G      9.1%  1.0%  2s
root@ #