I checked the issue with Wayne and found that there was API compatibility issue between monitoring agent and kafka.
A parameter "version" is required in output config file.
Original Message:
Sent: 07-03-2021 12:16
From: Wayne Lee
Subject: Difficulties standing up the 128T-monitoring POC
Hi,
have you solved the problem ?
I have the same problem after upgrade router to 5.1.3.
ELK is able to receive log from routers running version 4.5.5 not 5.1.3
thanks,
Wayne
------------------------------
Wayne Lee
Network Engineer
Hong Kong
(852) 2138 9388
Original Message:
Sent: 04-30-2021 10:28
From: Ryan Sitzman
Subject: Difficulties standing up the 128T-monitoring POC
I agree, it looks like things on the router are configured correctly and you have connectivity to the kafka broker.
I'm not super familiar with kafka, but could you be reaching a connection limit? You could try disabling all but one of the monitoring agent inputs and see if that improves service stability.
You could also try increasing the push-interval in your config.yaml to a larger value, maybe 60 seconds. That should help keep the agents from hammering on it.
------------------------------
Ryan Sitzman
Systems Engineer
WA
Original Message:
Sent: 04-28-2021 14:32
From: Chris Delaney
Subject: Difficulties standing up the 128T-monitoring POC
Thanks! The journalctl is reporting that it can't contact any Kafka brokers, however if I know the traffic is routing and if I kill the kafka container to free up the port I can successfully netcat on port 9092 between the two hosts.
I'd say ~90% of my files are out-of-the-box from the POC repo at this point, but I believe the pertinent config files would be:
Docker host
[working dir]/kafka.env
ADVERTISED_HOST=x.x.x.x (private IP of host)
running containers:
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[redacted] spotify/kafka "supervisord -n" 44 hours ago Up 1 second kafka
[redacted] docker.elastic.co/kibana/kibana:7.5.2 "/usr/local/bin/dumb…" 3 weeks ago Up 30 minutes kibana
[redacted] docker.elastic.co/elasticsearch/elasticsearch:7.5.2 "/usr/local/bin/dock…" 3 weeks ago Up 30 minutes elasticsearch
[redacted] docker.elastic.co/logstash/logstash:7.5.2 "/usr/local/bin/dock…" 3 weeks ago Up 1 second kafka-logstash
listening ports:
# netstat -ntlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1037/sshd
tcp 0 0 0.0.0.0:5601 0.0.0.0:* LISTEN 1375/node
tcp6 0 0 :::22 :::* LISTEN 1037/sshd
tcp6 0 0 :::9600 :::* LISTEN 3007/java
tcp6 0 0 :::9092 :::* LISTEN 2348/java
tcp6 0 0 :::2181 :::* LISTEN 2022/java
tcp6 0 0 :::41423 :::* LISTEN 2348/java
tcp6 0 0 :::9200 :::* LISTEN 1335/java
tcp6 0 0 :::9300 :::* LISTEN 1335/java
Router
/etc/128t-monitoring/config.yaml
name: router03_ZTP
enabled: true
tags:
- key: router
value: ${ROUTER}
sample-interval: 5
push-interval: 10
inputs:
- name: t128_arp_state
- name: t128_device_state
- name: t128_events
- name: t128_graphql
- name: t128_lte_metric
- name: t128_metrics
- name: t128_peer_path
- name: t128_top_analytics
outputs:
- name: kafka
/var/lib/128t-monitoring/outputs/kafka.conf
[[outputs.kafka]]
## URLs of kafka brokers
brokers = ["x.x.x.x:9092"] (private IP of docker host; matches IP of Docker host kafka.env)
## Kafka topic for producer messages
topic = "telegraf"
max_retry = 3
data_format = "json"
The journal entry for the service specified (all the rest are the same gist):
Apr 28 17:56:49 router03 systemd[1]: Started 128T telegraf service for router03_ZTP/t128_metrics.
Apr 28 17:56:49 router03 telegraf[16886]: 2021-04-28T17:56:49Z I! Starting Telegraf 1.17.4
Apr 28 17:56:49 router03 telegraf[16886]: 2021-04-28T17:56:49Z I! Loaded inputs: t128_metrics
Apr 28 17:56:49 router03 telegraf[16886]: 2021-04-28T17:56:49Z I! Loaded aggregators:
Apr 28 17:56:49 router03 telegraf[16886]: 2021-04-28T17:56:49Z I! Loaded processors:
Apr 28 17:56:49 router03 telegraf[16886]: 2021-04-28T17:56:49Z I! Loaded outputs: kafka
Apr 28 17:56:49 router03 telegraf[16886]: 2021-04-28T17:56:49Z I! Tags enabled: host=router03 router=Router03
Apr 28 17:56:49 router03 telegraf[16886]: 2021-04-28T17:56:49Z I! [agent] Config: Interval:5s, Quiet:false, Hostname:"router03", Flush Interval:10s
Apr 28 17:56:50 router03 telegraf[16886]: 2021-04-28T17:56:50Z E! [telegraf] Error running agent: could not initialize output kafka: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)
Apr 28 17:56:50 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service: main process exited, code=exited, status=1/FAILURE
Apr 28 17:56:50 router03 systemd[1]: Unit 128T-telegraf@router03_ZTP-t128_metrics.service entered failed state.
Apr 28 17:56:50 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service failed.
Apr 28 17:56:50 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service holdoff time over, scheduling restart.
Apr 28 17:56:50 router03 systemd[1]: Stopped 128T telegraf service for router03_ZTP/t128_metrics.
Apr 28 17:56:50 router03 systemd[1]: Started 128T telegraf service for router03_ZTP/t128_metrics.
Apr 28 17:56:50 router03 telegraf[17019]: 2021-04-28T17:56:50Z I! Starting Telegraf 1.17.4
Apr 28 17:56:50 router03 telegraf[17019]: 2021-04-28T17:56:50Z I! Loaded inputs: t128_metrics
Apr 28 17:56:50 router03 telegraf[17019]: 2021-04-28T17:56:50Z I! Loaded aggregators:
Apr 28 17:56:50 router03 telegraf[17019]: 2021-04-28T17:56:50Z I! Loaded processors:
Apr 28 17:56:50 router03 telegraf[17019]: 2021-04-28T17:56:50Z I! Loaded outputs: kafka
Apr 28 17:56:50 router03 telegraf[17019]: 2021-04-28T17:56:50Z I! Tags enabled: host=router03 router=Router03
Apr 28 17:56:50 router03 telegraf[17019]: 2021-04-28T17:56:50Z I! [agent] Config: Interval:5s, Quiet:false, Hostname:"router03", Flush Interval:10s
Apr 28 17:56:51 router03 telegraf[17019]: 2021-04-28T17:56:51Z E! [telegraf] Error running agent: could not initialize output kafka: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)
Apr 28 17:56:51 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service: main process exited, code=exited, status=1/FAILURE
Apr 28 17:56:51 router03 systemd[1]: Unit 128T-telegraf@router03_ZTP-t128_metrics.service entered failed state.
Apr 28 17:56:51 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service failed.
Apr 28 17:56:51 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service holdoff time over, scheduling restart.
Apr 28 17:56:51 router03 systemd[1]: Stopped 128T telegraf service for router03_ZTP/t128_metrics.
Apr 28 17:56:51 router03 systemd[1]: Started 128T telegraf service for router03_ZTP/t128_metrics.
Apr 28 17:56:51 router03 telegraf[17087]: 2021-04-28T17:56:51Z I! Starting Telegraf 1.17.4
Apr 28 17:56:51 router03 telegraf[17087]: 2021-04-28T17:56:51Z I! Loaded inputs: t128_metrics
Apr 28 17:56:51 router03 telegraf[17087]: 2021-04-28T17:56:51Z I! Loaded aggregators:
Apr 28 17:56:51 router03 telegraf[17087]: 2021-04-28T17:56:51Z I! Loaded processors:
Apr 28 17:56:51 router03 telegraf[17087]: 2021-04-28T17:56:51Z I! Loaded outputs: kafka
Apr 28 17:56:51 router03 telegraf[17087]: 2021-04-28T17:56:51Z I! Tags enabled: host=router03 router=Router03
Apr 28 17:56:51 router03 telegraf[17087]: 2021-04-28T17:56:51Z I! [agent] Config: Interval:5s, Quiet:false, Hostname:"router03", Flush Interval:10s
Apr 28 17:56:52 router03 telegraf[17087]: 2021-04-28T17:56:52Z E! [telegraf] Error running agent: could not initialize output kafka: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)
Apr 28 17:56:52 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service: main process exited, code=exited, status=1/FAILURE
Apr 28 17:56:52 router03 systemd[1]: Unit 128T-telegraf@router03_ZTP-t128_metrics.service entered failed state.
Apr 28 17:56:52 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service failed.
Apr 28 17:56:52 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service holdoff time over, scheduling restart.
Apr 28 17:56:52 router03 systemd[1]: Stopped 128T telegraf service for router03_ZTP/t128_metrics.
Apr 28 17:56:52 router03 systemd[1]: Started 128T telegraf service for router03_ZTP/t128_metrics.
Apr 28 17:56:52 router03 telegraf[17156]: 2021-04-28T17:56:52Z I! Starting Telegraf 1.17.4
Apr 28 17:56:52 router03 telegraf[17156]: 2021-04-28T17:56:52Z I! Loaded inputs: t128_metrics
Apr 28 17:56:52 router03 telegraf[17156]: 2021-04-28T17:56:52Z I! Loaded aggregators:
Apr 28 17:56:52 router03 telegraf[17156]: 2021-04-28T17:56:52Z I! Loaded processors:
Apr 28 17:56:52 router03 telegraf[17156]: 2021-04-28T17:56:52Z I! Loaded outputs: kafka
Apr 28 17:56:52 router03 telegraf[17156]: 2021-04-28T17:56:52Z I! Tags enabled: host=router03 router=Router03
Apr 28 17:56:52 router03 telegraf[17156]: 2021-04-28T17:56:52Z I! [agent] Config: Interval:5s, Quiet:false, Hostname:"router03", Flush Interval:10s
Apr 28 17:56:53 router03 telegraf[17156]: 2021-04-28T17:56:53Z E! [telegraf] Error running agent: could not initialize output kafka: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)
Apr 28 17:56:53 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service: main process exited, code=exited, status=1/FAILURE
Apr 28 17:56:53 router03 systemd[1]: Unit 128T-telegraf@router03_ZTP-t128_metrics.service entered failed state.
Apr 28 17:56:53 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service failed.
Apr 28 17:56:53 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service holdoff time over, scheduling restart.
Apr 28 17:56:53 router03 systemd[1]: Stopped 128T telegraf service for router03_ZTP/t128_metrics.
Apr 28 17:56:53 router03 systemd[1]: Started 128T telegraf service for router03_ZTP/t128_metrics.
Apr 28 17:56:53 router03 telegraf[17228]: 2021-04-28T17:56:53Z I! Starting Telegraf 1.17.4
Apr 28 17:56:53 router03 telegraf[17228]: 2021-04-28T17:56:53Z I! Loaded inputs: t128_metrics
Apr 28 17:56:53 router03 telegraf[17228]: 2021-04-28T17:56:53Z I! Loaded aggregators:
Apr 28 17:56:53 router03 telegraf[17228]: 2021-04-28T17:56:53Z I! Loaded processors:
Apr 28 17:56:53 router03 telegraf[17228]: 2021-04-28T17:56:53Z I! Loaded outputs: kafka
Apr 28 17:56:53 router03 telegraf[17228]: 2021-04-28T17:56:53Z I! Tags enabled: host=router03 router=Router03
Apr 28 17:56:53 router03 telegraf[17228]: 2021-04-28T17:56:53Z I! [agent] Config: Interval:5s, Quiet:false, Hostname:"router03", Flush Interval:10s
Apr 28 17:56:54 router03 telegraf[17228]: 2021-04-28T17:56:54Z E! [telegraf] Error running agent: could not initialize output kafka: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)
Apr 28 17:56:54 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service: main process exited, code=exited, status=1/FAILURE
Apr 28 17:56:54 router03 systemd[1]: Unit 128T-telegraf@router03_ZTP-t128_metrics.service entered failed state.
Apr 28 17:56:54 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service failed.
Apr 28 17:56:54 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service holdoff time over, scheduling restart.
Apr 28 17:56:54 router03 systemd[1]: Stopped 128T telegraf service for router03_ZTP/t128_metrics.
Apr 28 17:56:54 router03 systemd[1]: start request repeated too quickly for 128T-telegraf@router03_ZTP-t128_metrics.service
Apr 28 17:56:54 router03 systemd[1]: Failed to start 128T telegraf service for router03_ZTP/t128_metrics.
Apr 28 17:56:54 router03 systemd[1]: Unit 128T-telegraf@router03_ZTP-t128_metrics.service entered failed state.
Apr 28 17:56:54 router03 systemd[1]: 128T-telegraf@router03_ZTP-t128_metrics.service failed.
Kafka is running, based on the fact that I can exec into the container and get data back:
# /opt/kafka_2.11-0.10.1.0/bin/kafka-topics.sh --list --zookeeper localhost
__consumer_offsets
telegraf
And I can create a session manually:
I'm sure whatever I'm missing is something obvious, but I have no idea what it could be!!
------------------------------
Chris Delaney
Lynchburg VA
Original Message:
Sent: 04-28-2021 11:18
From: Ryan Sitzman
Subject: Difficulties standing up the 128T-monitoring POC
Hey Chris,
The monitoring agent config can definitely be tricky to get right, but it sounds like you're 90% there!
Try checking the journal for any clues as to why the service is failing. Something like:
journalctl -fu 128T-telegraf@router03_ZTP-t128_metrics.service
The contents of your config.yaml and any journal output from the services will be helpful for further troubleshooting.
-Ryan
------------------------------
Ryan Sitzman
Systems Engineer
WA
Original Message:
Sent: 04-26-2021 20:08
From: Chris Delaney
Subject: Difficulties standing up the 128T-monitoring POC
I want to get an idea of the components behind the monitoring agent and am attempting to set up the POC environment using a variety of sources such as the 128T GitHub repo, the 128T Monitoring Agent doc, and a 128T blog from last year, but am having difficulty getting the agents to export data to the monitoring stack.
Specifically, the plugin configuration section of the 128T Docs page indicates that I should be able to use the Conductor UI to select the desired inputs but I don't have "monitoring" under "authority" nor do I have it listed as an available plugin to add. Additionally, the command "configure authority monitoring" returns, "Command 'monitoring' not found". I'm running on version 4.5, and did see the note under the Installation section indicating requiring version 5.1.0 so realize this may not be applicable to me. The next next line down indicates that it can be installed manually so long as I'm higher than 4.1 however, which is what I've done.
I've configured the agent configs to use the sample inputs (copied to the /var/lib/128t-monitoring/inputs directory) to ship them to my kafka output and the configs pass validation (monitoring-agent-cli validate), but when I start the daemons with monitoring-agent-cli configure they all fail within seconds:
$ systemctl list-units 128T-telegraf*
UNIT LOAD ACTIVE SUB DESCRIPTION
●
128T-telegraf@router03_ZTP-t128_arp_state.service
loaded failed failed 128T telegraf service for router03_ZTP/t128_arp_state
●
128T-telegraf@router03_ZTP-t128_device_state.service
loaded failed failed 128T telegraf service for router03_ZTP/t128_device_state
●
128T-telegraf@router03_ZTP-t128_events.service
loaded failed failed 128T telegraf service for router03_ZTP/t128_events
●
128T-telegraf@router03_ZTP-t128_graphql.service
loaded failed failed 128T telegraf service for router03_ZTP/t128_graphql
●
128T-telegraf@router03_ZTP-t128_lte_metric.service
loaded failed failed 128T telegraf service for router03_ZTP/t128_lte_metric
●
128T-telegraf@router03_ZTP-t128_metrics.service
loaded failed failed 128T telegraf service for router03_ZTP/t128_metrics
●
128T-telegraf@router03_ZTP-t128_peer_path.service
loaded failed failed 128T telegraf service for router03_ZTP/t128_peer_path
●
128T-telegraf@router03_ZTP-t128_top_analytics.service
loaded failed failed 128T telegraf service for router03_ZTP/t128_top_analytics
I see traffic coming from the agent to my monitoring VM as the processes start up so they seem to be communicating initially, but then they fail. I can manually connect between the agent hosts & my monitor host using netcat on port 9202 however, so I don't believe it to be a connectivity issue. The Kibana dashboard is up, and Kafka is listening on 9092.
I would be very welcome to any suggestions from anyone familiar with this POC environment!
------------------------------
Chris Delaney
Lynchburg VA
------------------------------