Linux Tutorial: Deploying OpenSIPS (OpenSer) Under Linux-HA - Heartbeat v2.0
heartbeat (or more formally, Linux-HA) provides application monitoring with the ability to restart or migrate a service (like OpenSIPS or OpenSER) and dependent resources (like IP addresses) to other machines in the event of a failure. Typically a monitoring process returns the status of a resource. (can be as simple as a ping or as complex as a full fledged application level test) In the event of a failure, a tree of services (typically the IP alias and the service that runs on top of it) are restarted or migrated to a new, more desirable node.
The Linux-HA project started as a simple process monitoring and failover application that didn't take service hierarchy into account among other things. Version 2 of Linux-HA was major rewrite of the application which added hierarchically defined services and used the industry standard OCF definition to describe service monitoring tools and dependency trees.
OCF files for the services are kept in /usr/lib/ocf/resource.d and are grouped by directories named after each provider. The included provider is heartbeat which supplies (among other things) IPaddr2 which I use for IP address setup, teardown and monitoring. It differs from IPaddr (also in that directory) in that it is iproute2 aware. The other provider I use is anders.com which contains the OpenSIPS OCF provider. This process controls and monitors OpenSIPS on the application level. (by using sipsak to send test calls to the application layer)
The service definition hierarchy is maintained in the /var/lib/heartbeat/crm/cib.xml file. This is the main file for configuring Linux-HA. It is VERY finicky.
During normal operation, the cib.xml file will be synchronized between all the nodes which means it will get rewritten. It contains the state information for the services being monitored and hashes for each of the nodes in the group. If you need to make a change to the cib.xml file, start by shutting down all of the nodes in the group. Make sure you keep all IDs unique across the file and be aware of the backup files in the same directory. It doesn't hurt to blow everything except for the cib.xml file away on all machines when heartbeat is stopped to make sure all nodes are in sync. Once you have made the changes you wish to make, increment the admin_epoch number in the cib.xml and copy it to each of the participating nodes. Start the preferred node before any others to minimize service migration.
The ha.cf file in /etc/ha.d configures some very basic heartbeat options. Most significantly, it dictates wither or not the CRM engine is on. This essentially differentiates between the old heartbeat version 1 and the new heartbeat version 2 with CRM support. (I use v2 with CRM) The ha.cf file also lists all the nodes that will be participating in the cluster and how inter-cluster communication will work. In this case we will be sending broadcasts from eth0.10, or VLAN 10 on eth0.
Note: It is very important to name the nodes what the output of the uname -a command reports. You can't just pick whatever name might sounds good to you unless you rename the machine itself.
/etc/ha.d/ha.cf
udpport 469
bcast eth0.10
node sip-a sip-b
crm on
When running multiple heartbeat setups on the same broadcast segment you must use a separate port for each setup.
Authkeys
As the name implies, the authkeys file lists the private strings the nodes will use as keys to authenticate communication between the nodes. As this is private data, the file should only be readable by root.
chown root:root /etc/ha.d/authkeys
chmod 600 /etc/ha.d/authkeys
The file lists the encryption method (md5) and the string to be used.
/etc/ha.d/authkeys
auth 1
1 md5 a0ff2cc2bbdff6c7a55090ea4f55400f
The cib.xml File
This is an example cib.xml file that assumes two IP addresses and runs OpenSIPS on them. In the event of a migration, the new node will start the IP addresses (sending a gratuitous arp) and then start OpenSIPS. The order of services (what depends on what) is described in this example.
/var/lib/heartbeat/crm/cib.xml
<cib>
<configuration>
<crm_config>
<cluster_property_set id="cluster-property-set">
<attributes>
<nvpair id="short_resource_names" name="short_resource_names" value="true"/>
<nvpair id="pe-input-series-max" name="pe-input-series-max" value="-1"/>
<nvpair id="default-resource-stickiness" name="default-resource-stickiness" value="10"/>
<nvpair id="default-resource-failure-stickiness" name="default-resource-failure-stickiness" value="-10"/>
<nvpair id="start-failure-is-fatal" name="start-failure-is-fatal" value="false"/>
</attributes>
</cluster_property_set>
<cluster_property_set id="cib-bootstrap-options">
<attributes>
<nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1194982799"/>
</attributes>
</cluster_property_set>
</crm_config>
<nodes />
<resources>
<group id="IPaddr2_OpenSIPS_group">
<primitive id="IPaddr2-1.2.3.4" class="ocf" type="IPaddr2" provider="heartbeat">
<operations>
<op id="ipaddr2-1.2.3.4-monitor" name="monitor" interval="5s" timeout="3s"/>
</operations>
<instance_attributes id="IPaddr2-1.2.3.4-attributes">
<attributes>
<nvpair id="ipaddr2-1.2.3.4-ip" name="ip" value="1.2.3.4"/>
<nvpair id="ipaddr2-1.2.3.4-broadcast" name="broadcast" value="1.2.3.255"/>
<nvpair id="ipaddr2-1.2.3.4-cidr_netmask" name="cidr_netmask" value="24"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="IPaddr2-1.2.3.5" class="ocf" type="IPaddr2" provider="heartbeat">
<operations>
<op id="ipaddr2-1.2.3.5-monitor" name="monitor" interval="5s" timeout="3s"/>
</operations>
<instance_attributes id="IPaddr2-1.2.3.5-attributes">
<attributes>
<nvpair id="ipaddr2-1.2.3.5-ip" name="ip" value="1.2.3.5"/>
<nvpair id="ipaddr2-1.2.3.5-broadcast" name="broadcast" value="1.2.3.255"/>
<nvpair id="ipaddr2-1.2.3.5-cidr_netmask" name="cidr_netmask" value="24"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="OpenSIPS" class="ocf" type="OpenSIPS" provider="anders.com">
<operations>
<op id="opensips-start" name="start" timeout="5s"/>
<op id="opensips-stop" name="stop" timeout="3s"/>
<op id="opensips-monitor" name="monitor" interval="10s" timeout="3s">
<instance_attributes id="monitor_10s">
<attributes>
<nvpair id="opensips-monitor-ip" name="ip" value="127.0.0.1"/>
</attributes>
</instance_attributes>
</op>
</operations>
</primitive>
</group>
</resources>
<constraints>
<rsc_location id="OpenSIPS_resource_location" rsc="OpenSIPS">
<rule id="rule_sip-a" score="100">
<expression id="expression_uname_eq_sip-a" attribute="#uname" operation="eq" value="sip-a"/>
</rule>
<rule id="rule_sip-b" score="10">
<expression id="expression_uname_eq_sip-b" attribute="#uname" operation="eq" value="sip-b"/>
</rule>
</rsc_location>
</constraints>
</configuration>
</cib>
We verify cib files with crm_verify:
crm_verify -x /var/lib/heartbeat/crm/cib.xml
Make sure you set the ownership to cluster:cluster on that file and kill backup versions in the off chance they might conflict with the new cib.xml file.
rm /var/lib/heartbeat/crm/cib.xml.*
chown cluster:cluster -R /var/lib/heartbeat/crm/
OCF Files
I wrote my own OCF file for monitoring OpenSIPS which implements sipsak to do application level testing over 127.0.0.1. (make sure OpenSIPS listens on 127.0.0.1 as well)
/usr/lib/ocf/resource.d/OpenSIPS
#!/bin/sh
# Initialization:
. /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs
usage() {
cat <<-!
usage: $0 {start|stop|status|monitor|meta-data|validate-all}
!
}
meta_data() {
cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="OpenSIPS">
<version>1.0</version>
<longdesc lang="en">
Resource Agent for the OpenSIPS SIP Proxy.
</longdesc>
<shortdesc lang="en">OpenSIPS resource agent</shortdesc>
<parameters>
<parameter name="ip" unique="0" required="1">
<longdesc lang="en">
IP Address of the OpenSIPS Instance. This is only used for monitoring.
</longdesc>
<shortdesc lang="en">IP Address</shortdesc>
<content type="string" default="" />
</parameter>
<parameter name="port" unique="0" required="1">
<longdesc lang="en">
Port of the OpenSIPS Instance. This is only used for monitoring.
</longdesc>
<shortdesc lang="en">Port</shortdesc>
<content type="string" default="5060" />
</parameter>
</parameters>
<actions>
<action name="start" timeout="30" />
<action name="stop" timeout="30" />
<action name="status" depth="0" timeout="30" interval="10" start-delay="30" />
<action name="monitor" depth="0" timeout="30" interval="10" start-delay="30" />
<action name="meta-data" timeout="5" />
<action name="validate-all" timeout="5" />
<action name="notify" timeout="5" />
<action name="promote" timeout="5" />
<action name="demote" timeout="5" />
</actions>
</resource-agent>
END
}
OpenSIPS_Status() {
#echo "/usr/bin/sipsak -s sip:test@$OCF_RESKEY_ip -H 127.0.0.1 2>/dev/null >/dev/null" > /tmp/a
/usr/bin/sipsak -s sip:test@$OCF_RESKEY_ip -H 127.0.0.1 2>/dev/null >/dev/null
rc=$?
if
[ $rc -ne 0 ]
then
return $OCF_NOT_RUNNING
else
return $OCF_SUCCESS
fi
}
OpenSIPS_Monitor( ) {
OpenSIPS_Status
}
OpenSIPS_Start( ) {
if
OpenSIPS_Status
then
ocf_log info "OpenSIPS already running."
return $OCF_SUCCESS
else
/etc/init.d/opensips start >/dev/null
rc=$?
if
[ $rc -ne 0 ]
then
return $OCF_ERR_PERM
else
return $OCF_SUCCESS
fi
fi
}
OpenSIPS_Stop( ) {
/etc/init.d/opensips stop >/dev/null
return $OCF_SUCCESS
}
OpenSIPS_Validate_All( ) {
return $OCF_SUCCESS
}
if [ $# -ne 1 ]; then
usage
exit $OCF_ERR_ARGS
fi
case $1 in
meta-data) meta_data
exit $OCF_SUCCESS
;;
start) OpenSIPS_Start
;;
stop) OpenSIPS_Stop
;;
monitor) OpenSIPS_Monitor
;;
status) OpenSIPS_Status
;;
validate-all) OpenSIPS_Validate_All
;;
notify) exit $OCF_SUCCESS
;;
promote) exit $OCF_SUCCESS
;;
demote) exit $OCF_SUCCESS
;;
usage) usage
exit $OCF_SUCCESS
;;
*) usage
exit $OCF_ERR_ARGS
;;
esac
exit $?
We use the OCF tester to check the validity of this OCF file. (Make sure you set the IP to the service address on your system. Be aware that this will start the service so it can test application monitoring and shutdown so don't run it on production IPs unless you know what you are doing.)
/usr/lib/ocf/resource.d/ocf-tester -o ip=127.0.0.1 /usr/lib/ocf/resource.d/anders.com/OpenSIPS
Hacks to the Standard Gentoo Heartbeat Build
I don't emerge heartbeat but rather build it from source. (heartbeat-2.1.2 as of this writing) However, older installs may have left an incompatible version of heartbeat installed whose elements can conflict. Typically this will show up in the logs as a crash of pengine or some other process heartbeat spawns. To avoid these errors, rm -fr /usr/lib/heartbeat and re-install.
To configure from source, build and install:
./ConfigureMe configure
make
make install
This will configure and build a setup with config files in a Gentoo-ish layout. You will find most important configuration in:
/etc/ha.d
/usr/lib/ocf/resource.d
/usr/local/var/run/heartbeat/crm
/etc/init.d/heartbeat uses killproc but that could be a little too random if you run multiple instances of opensips on the same machine. However, it does write a PID file when it starts heartbeat so we change the killproc line to:
kill `cat $PIDFILE` &>/dev/null
Using killproc or killall might kill other instances of opensips on the same machine so killing the master PID is a much better solution.
If monit is on the box, it is usually start/stopped from within the /etc/init.d/opensips file. The conventional /etc/init.d/opensips start command starts monit in this case and it in turn executes /etc/init.d/opensips opensipsstart to get opensips running. When I use heartbeat, there is no reason to use monit but I still have to start opensips with /etc/init.d/opensips opensipsstart. (you might need to change this in the OCF file above)
Heartbeat with OpenSIPS Checklist
When activating a heartbeat controlled OpenSIPS setup make sure to:
* Blow away any only heartbeat installs
rm -fr /var/lib/heartbeat
rm -fr /usr/lib/heartbeat
* Compile from source and install the latest-greatest tested release. (heartbeat-2.1.2 as of this writing)
./ConfigureMe configure
make
make install
* Edit /var/lib/heartbeat/crm/cib.xml to taste.
* Kill all old cib.xml.* files:
rm /var/lib/heartbeat/crm/cib.xml.*
* Set the file ownership on the crm directory and files:
chown cluster:cluster -R /var/lib/heartbeat/crm/
* Edit /etc/init.d/opensips
o Make sure the correct version of opensips gets started.
o The OCF file will want to run a /etc/init.d/opensips start so make sure start will work or change the OCF to run the command opensipsstart instead if monit changed /etc/init.d/opensips.
o Make sure killproc isn't used. Instead, kill the pid from the pidfile as mentioned above.
* Edit /etc/ha.d/ha.cf to make it look something like this:
udpport 469
bcast eth0.10
node sip-a sip-b
crm on
* Make sure you have sipsak on your box in /usr/bin/sipsak.
which sipsak
* Make sure the same IPs are specified in the opensips.cfg and the cib.xml files.
* Make sure the IP used for monitoring is 127.0.0.1 in the cib.xml.
* Make sure that opensips is listening on 127.0.0.1 as well as it's production IPs.
* In the case where a nameserver isn't reachable, OpenSIPS will hang on trying to reverse resolve the production IPs so add entries for them in /etc/hosts that reflect the real names so there are as few external dependencies as possible.
* Make sure OpenSIPS is configured to respond to OPTIONS messages on 127.0.0.1 so the OCF Tester can test the application-level health of OpenSIPS.
* Test sipsak to make sure it succeeds / fails when the service is on / off.
/usr/bin/sipsak -s sip:test@127.0.0.1 -H 127.0.0.1
* Make sure OpenSIPS has libpg.so.5 for the PostGreSQL module if you are using PostGreSQL. If not, install PostGreSQL 8.2.5 or later. (as of this writing)
* Add the production IPs to the box (or comment them out of opensips.cfg) and use the OCF tester to make sure it can start / monitor / stop opensips.
ip address add 1.2.3.4/24 dev eth0.10
ip address add 1.2.3.5/24 dev eth0.10
/usr/lib/ocf/resource.d/ocf-tester -o ip=127.0.0.1 /usr/lib/ocf/resource.d/anders.com/OpenSIPS
Make sure it says "/usr/lib/ocf/resource.d/anders.com/OpenSIPS passed all tests"
If heartbeat yammers like this:
Nov 28 20:52:28 sip-a heartbeat: [22070]: WARN: nodename sip-a uuid changed to sip-b
Nov 28 20:52:28 sip-a heartbeat: [22070]: debug: displaying uuid table
Nov 28 20:52:28 sip-a heartbeat: [22070]: debug: uuid=9052abe5-87ee-4400-a008-c5f13205e94b, name=sip-a
Nov 28 20:52:28 sip-a heartbeat: [22070]: ERROR: should_drop_message: attempted replay attack [sip-b]? [gen = 10, curgen = 21]
then kill this file:
rm /var/lib/heartbeat/hb_uuid
Controlling Heartbeat
To get an overview of what's going on, run:
crm_mon
To list the resources under control:
crm_resource -L
To push a resource off of this box:
crm_resource -M -r OpenSIPS
This creates a constraint scored at INFINITY saying that a resource should not run on this host.
To remove an INFINITY constraint placed by the above command:
crm_resource -U -r OpenSIPS
When a resource is moved off of a node because it can't be started (for example when the opensips.cfg file is broken) the node is marked as bad and the resource is migrated to another node. After fixing the resource, you will need to clear the resource before it will migrate back to the primary. That is done like this:
crm_resource -C -r OpenSIPS
However, when a resource fails for whatever reason, it's failure count is incremented. To actually "fail-back" to the primary node you must also make sure the failure count is below the threshold for that resource. (A good practice is to set it back to 0)
To see the failure count:
crm_failcount -G -U sip-a -r OpenSIPS
To reset the failure count:
crm_failcount -v 0 -U aip-a -r OpenSIPS
Configuring when to move a service from node to node is done through scores assigned to individual nodes and the stickiness / failure-stickiness of resources.
The calculation is:
(sip-a score - sip-b score + stickiness) / abs(failure stickiness)
In our case, the settings are:
sip-a = 100
sip-b = 10
default stickiness = 10
stickiness = 30 (10 for each resource: ip, ip, OpenSIPS)
failure stickiness = -10
So:
(sip-a - sip-b + stickiness) / abs(failure stickiness)
= (100 - 10 + (10 + 10 + 10)) / 10
= 130 / 10
= 13
Therefore, in this case OpenSIPS can fail 13 times on sip-a before being moved to sip-b.
Of course if a service fails to start, it is immediately moved and the node marked bad. This is desierable for a service that we don't want to see down because the service will in effect revert to the last known-good configuration running on the backup node. This allows us to fix our primary node while the service runs in backup.
Manually Failing Back
If OpenSIPS fails to start on a node, (for example when you have a broken config file) the node is marked as bad and a restart won't be attempted.
To force a resource to fail back to the primary, you should reset the failure counts to 0 on the primary and backup:
crm_failcount -v 0 -U sip-a -r OpenSIPS
crm_failcount -v 0 -U sip-b -r OpenSIPS
and clear the OpenSIPS resource so it forgets where it wasn't able to start.
crm_resource -C -r OpenSIPS
This should work in all cases. If the resource still migrates to the backup node, there is a good chance OpenSIPS is still broken on the primary node.
lrmd CPU usage
A patch for lrmd that reduces CPU usage is here: http://hg.linux-ha.org/dev/rev/0ded50597e97
Tags
Linux Tutorial OpenSIPS OpenSER Linux-HA Heartbeat v2.0Trackbacks
To send a trackback, use the URL of this story appending ?page=tb at the end.Comments (1)
Anders from RTP
I've been asked for the OpenSER start / stop script I use. This comes with Gentoo. (I think)
/etc/init.d/openser
Leave a Comment
To create links in comments:
[link:http://www.anders.com/] becomes http://www.anders.com/
[link:http://www.anders.com/|Anders.com] becomes Anders.com
Notice there is no rel="nofollow" in these hrefs. Links in comments will carry page rank from this site so only link to things worthy of people's attention.