GoTel: Who Monitors the Monitors?

GoTel: Who Monitors The Monitors?

CrowdStrike Cloud Engineering is releasing GoTel which is an internal monitoring service that aims to ensure scheduled jobs, cronjobs, batch oriented work, or general scheduled tasks are completing successfully and within a set SLA time period.

  • Provides coordinator/worker pattern to ensure one gotel is always operational
  • HTTP based API to enable easy integration
  • Ability to pause checkins during scheduled maintenance periods

Overview: 

GoTel is for monitoring scheduled operations.

Most companies have scheduled reports, cron jobs, backup jobs, random data process tasks,  etc…, various things that are expected to run perfectly but sometimes silently fail. GoTel will let them make a “reservation” which means they have to check in during their allotted time frame or alerts will be sent out to the world. When they run they can “checkin” to GoTel which updates their last check-in time.

This was born from years of experience with various “cron” type jobs that suddenly stop working because the job they ran had a port blocked on it from a firewall config, data sets grow and something that takes 10 minutes now takes 2 hours, something in the environment changed or a myriad of various other failure conditions occurred. GoTel is for when things need to run and you need an independent monitor in your network that is not locked in to a specific vendor. We’ve also seen alerting failures from 3rd party vendors that are supposed to give us the warm fuzzy feeling they’ll alert us when things stop working.

For a toy example take the case where you have a nightly job that removes old data from a data store.  It’s a simple one liner that runs every night. One day you hit the inflection point and indexes turn useless and your script now does a full table scan. Now your 20 minute data clean up job takes 5 hours and you didn’t know about it as soon as it happened.

GoTel is for when you don’t need the overhead of an “enterprise” grade schedule monitor. It’s for the microservice world where you have apps running in various languages, platforms and locations.

It’s expected that you have two GoTel instances up for redundancy (across datacenters). The coordinator will monitor the worker and the worker will monitor the coordinator to ensure GoTel is always operational and if not that alerts are sent out to avoid silent failures.

GoTel has been running inside CrowdStrike and has already caught cases in production of scheduled operations silently failing, reducing customer impact.

Follow the GoTel project on github: https://github.com/CrowdStrike/gotel

 

Stop Breaches with CrowdStrike Falcon request a live demo