02. 05. 2023 Davide Gallo Contribution, NetEye

Using Ansible to Automate Agent Deployment

NetEye relies on many agents in order to monitor just one server, some examples are: Icinga, Telegraf, Elastic beats, GLPI agent and so on.

As a Site Reliability Engineer, I’m responsible for ensuring that all these agents run smoothly. This can involve performing repetitive and time-consuming tasks like managing configurations, deploying updates, and provisioning new resources.

Ansible is an open-source automation tool that can automate tasks across entire IT environments, from servers and workstations to network devices and cloud services. It uses a simple, agentless approach that eliminates many of the complexities and headaches associated with traditional automation tools.

Another advantage of using Ansible is the ability to create predefined playbooks, letting you easily provision new servers, or update existing ones with a single command. This can help you save time and reduce the risk of human error, as well as ensure that your infrastructure is always up-to-date and secure.

For this reason, NetEye itself is installed, updated and upgraded with Ansible playbooks created by our R&D Team.

Use case

Suppose I’ve deployed new Fedora servers to scale our business application, and now the team has asked me to monitor the resources of this machines using Telegraf.

Step 1: Analyzing my Infrastructure

Say we have 3 Fedora servers named server1, server2, server3, a satellite named satellite.neteye that has SSH access to those servers, and I need to collect only basic metrics like disk IO, network, etc.

Step 2: Creating the Telegraf Configuration

Telegraf comes with multiple plugins that can be used to monitor your server. By default we use the NATS plugin output to send the metrics to the satellite. Your satellite already has the architecture to receive and send Telegraf metrics.

You only need to copy the user certificates from /neteye/local/telegraf/conf/certs/telegraf-agent.crt.pem and /neteye/local/telegraf/conf/certs/private/telegraf-agent.key.pem from your satellite to your target server, and set the Telegraf configuration that will look like this:

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = "/var/log/telegraf/telegraf.log"

[[outputs.nats]]
  servers = ["nats://satellite.neteye:4222"]
  subject = "telegraf.metrics"
  secure = true
  tls_cert = "/etc/telegraf/certs/telegraf-agent.crt.pem"
  tls_key = "/etc/telegraf/certs/private/telegraf-agent.key.pem"
  data_format = "influx"

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
  mount_points = ["/"]
  tagexclude = ["fstype", "device"]

[[inputs.mem]]
  tagexclude = ["mode"]

[[inputs.net]]
  interfaces = ["eth0"]
  tagexclude = ["interface"]

[[inputs.system]]
  tagexclude = ["host", "kernel", "uptime"]

Let’s save this configuration as telegraf.conf and place it and the above certificates in the same folder. (e.g. /root/ansible-telegraf)

Step 3: Writing the Ansible Playbook

As this blog post suggests, we want to automate the Telegraf deployment. In order to do so we need an Ansible playbook.

The playbook is the core of the Ansible solution, where you define the necessary steps to be executed. I won’t go into details here, but it’s mainly divided in 2 parts:

  • Playbook definition
    You specify how the playbook has to be executed. You define parallelization in executions, custom variables, hosts, etc.
  • Playbook tasks
    You define each Ansible plugin that has to be executed on the target or localhost

For our user case, the playbook will be something like this:

---
- name: Install and configure Telegraf
  hosts: server1, server2, server3
  become: true

  tasks:
    - name: Install Telegraf
      yum:
        name: telegraf
        state: present

    - name: Configure Telegraf
      copy:
        src: /root/ansible-telegraf
        dest: /etc/telegraf
      notify: restart telegraf

  handlers:
    - name: restart telegraf
      systemd:
        name: telegraf
        state: restarted
        enabled: true

On each host server1, server2, server3 we will:

  • Install Telegraf via yum
  • Copy the Telegraf configuration and certs from our satellite to the target servers. If the configuration has been copied, then the service will be restarted

Step 4: Executing the deploy

The playbook can be executed with the command ansible-playbook telegraf.yml -i localhost or by using a dynamic inventory generated, for example, by an inventory tool like GLPI.

Conclusion

By using Ansible to automate the deployment of agents like Telegraf, IT operations engineers can save significant amounts of time and effort while ensuring that all servers are monitored in a consistent and reliable manner.

With the ability to create predefined playbooks and easily provision new servers or update existing ones with a single command, Ansible can help reduce human error, and ensure that infrastructure is always up-to-date, secure and consistent.

So, it’s time to start using Ansible and reap the benefits of automation!

Resources

These Solutions are Engineered by Humans

Did you find this article interesting? Are you an “under the hood” kind of person? We’re really big on automation and we’re always looking for people in a similar vein to fill roles like this one as well as other roles here at Würth Phoenix.

Davide Gallo

Davide Gallo

Site Reliability Engineer at Würth Phoenix

Author

Davide Gallo

Leave a Reply

Your email address will not be published. Required fields are marked *

Archive