Search This Blog

Tuesday, April 8, 2014

Cloudera SCM Agent Error

"This host had been out of contact with Cloudera Manager for too long. The host's Cloudera Manager agent's software version could not be determined."

Today I saw this error pop up on the CM4 hosts monitor.  Running /etc/init.d/cloudera-scm-agent status only confirmed that the agent was running.  However I needed to review the logs to find the error.

The log for the agent is located at /var/log/cloudera-scm-agent/cloudera-scm-agent.log

The error reported looked like this:

[08/Apr/2014 15:58:09 +0000] 1228 MainThread agent        ERROR    Heartbeating to prodsrv01vmid.saic.com:7182 failed.
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/src/cmf/agent.py", line 741, in send_heartbeat
    self.master_port)
  File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/avro-1.6.3-py2.6.egg/avro/ipc.py", line 471, in __init__
    self.conn.connect()
  File "/usr/lib64/python2.6/httplib.py", line 720, in connect
    self.timeout)
  File "/usr/lib64/python2.6/socket.py", line 553, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno -5] No address associated with hostname

The problem was, that when the system rebooted, the file /etc/cloudera-scm-agent/config.ini was modified:

[General]
# Hostname of Cloudera SCM Server
server_host=prodsrv01vmid.saic.com

The DNS server had an old host name entry for the IP address my Cloudera SCM Server was now using.  When the system restarted the agent, I believe a DNS lookup was performed using the IP and resolved the old host name.  My cluster uses /etc/hosts files to maintain name resolution, so I'm not 100% sure yet why this happened, but I speculate it is a result of the socket library in python, used by the cloudera SCM agent.

Resolved by changing the server_host value back to the host with the SCM server running on it.  Then restarted the cloudera-scm-agent service.

No comments:

Post a Comment