"This host had been out of contact with Cloudera Manager for too long. The host's Cloudera Manager agent's software version could not be determined."
Today I saw this error pop up on the CM4 hosts monitor. Running /etc/init.d/cloudera-scm-agent status only confirmed that the agent was running. However I needed to review the logs to find the error.
The log for the agent is located at /var/log/cloudera-scm-agent/cloudera-scm-agent.log
The error reported looked like this:
[08/Apr/2014 15:58:09 +0000] 1228 MainThread agent ERROR Heartbeating to prodsrv01vmid.saic.com:7182 failed.
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/src/cmf/agent.py", line 741, in send_heartbeat
self.master_port)
File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 471, in __init__
self.conn.connect()
File "/usr/lib64/python2.6/httplib.py", line 720, in connect
self.timeout)
File "/usr/lib64/python2.6/socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno -5] No address associated with hostname
The problem was, that when the system rebooted, the file /etc/cloudera-scm-agent/config.ini was modified:
[General]
# Hostname of Cloudera SCM Server
server_host=prodsrv01vmid.saic.com
The DNS server had an old host name entry for the IP address my Cloudera SCM Server was now using. When the system restarted the agent, I believe a DNS lookup was performed using the IP and resolved the old host name. My cluster uses /etc/hosts files to maintain name resolution, so I'm not 100% sure yet why this happened, but I speculate it is a result of the socket library in python, used by the cloudera SCM agent.
Resolved by changing the server_host value back to the host with the SCM server running on it. Then restarted the cloudera-scm-agent service.
No comments:
Post a Comment