"This host had been out of contact with Cloudera Manager for too long. The host's Cloudera Manager agent's software version could not be determined."
Today I saw this error pop up on the CM4 hosts monitor. Running /etc/init.d/cloudera-scm-agent status only confirmed that the agent was running. However I needed to review the logs to find the error.
The log for the agent is located at /var/log/cloudera-scm-agent/cloudera-scm-agent.log
The error reported looked like this:
[08/Apr/2014 15:58:09 +0000] 1228 MainThread agent ERROR Heartbeating to prodsrv01vmid.saic.com:7182 failed.
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/src/cmf/agent.py", line 741, in send_heartbeat
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 471, in __init__
File "/usr/lib64/python2.6/httplib.py", line 720, in connect
File "/usr/lib64/python2.6/socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno -5] No address associated with hostname
The problem was, that when the system rebooted, the file /etc/cloudera-scm-agent/config.ini was modified:
# Hostname of Cloudera SCM Server
The DNS server had an old host name entry for the IP address my Cloudera SCM Server was now using. When the system restarted the agent, I believe a DNS lookup was performed using the IP and resolved the old host name. My cluster uses /etc/hosts files to maintain name resolution, so I'm not 100% sure yet why this happened, but I speculate it is a result of the socket library in python, used by the cloudera SCM agent.
Resolved by changing the server_host value back to the host with the SCM server running on it. Then restarted the cloudera-scm-agent service.