Monitoring Metries for user stalking

From CitconWiki
Revision as of 00:32, 4 October 2013 by Anhngoc.phung (talk | contribs) (Created page with "==Monitoring: Metries for user stalking == Mike started with description about what he currently uses in his environment at [http://www.mercateo.com Mercateo] 1. What Mike c...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Monitoring: Metries for user stalking

Mike started with description about what he currently uses in his environment at Mercateo

1. What Mike currently uses, and what the problem he faces

  • He's using Nagios as monitoring solution with snmp ( Jeff also confirmed that he use Nagios too.)
  • The check intervall ist about 5 minutes and the nagios has to do the check every 5 minutes for all services which causes the performance issue, and it's really bad if the numbers of services increases, which it always does.
  • As the check interval 5 minutes long, so that sometime if the machine reboots too fast then nagios didn't recognized it and there is no notifications!
  • What the current solution: he is using check_mk, collects all checks and ships to the nagios so that Nagios doesn't have to check everything, it helps Nagios to reduce some kind of load, but it not really improved so that he didn't migrate all to check_mk.
  • For sure that there is always another solutions for it but we want to keep it simple with one uniform solution using nagios.


2. Jeff added something that he currently implemented with logstash for his implementation he's using messages server (ZeroMQ) in combination with logstash but it's also does not guarantee if the message get lost as ZeroMQ does not hold them. @Carlo Bonamico: You added something for this topic but i did forget, can you help me to fill it out?


3. Squirrel informed what his team currently use

 logentries
 stackdrive

4. Discussion and findings

  • Jeff, Squirrel confirmed that they are using PagerDuty for EoD because its can directly alert to the right person who can solve the issue.
  • We should give a try with mCollective to aggregate Nagios Checks: Anyone has experience feels free to give a update/ feedback
  • There are also another solution with can help, but it requires to investigate and test.