TscMon: Complete Overview and Key Features

Troubleshooting Common TscMon Issues and Fixes

1. TscMon service fails to start

  • Symptoms: Service won’t start, shows “failed” in systemctl or exits immediately.
  • Likely causes: Configuration syntax error, missing dependencies, corrupted binary, or permission issues.
  • Fixes:
    1. Check status and logs:

      Code

      sudo systemctl status tscmon sudo journalctl -u tscmon –no-pager -n 200
    2. Validate configuration file (assume /etc/tscmon/tscmon.yml):

      Code

      tscmon –config-test /etc/tscmon/tscmon.yml

      If no built-in checker, run YAML lint:

      Code

      yamllint /etc/tscmon/tscmon.yml
    3. Verify dependencies are installed (e.g., required databases, language runtimes). Reinstall packages if needed:

      Code

      sudo apt-get install –reinstall tscmon
    4. Check file permissions and ownership for config, binaries, and data directories:

      Code

      sudo chown -R tscmon:tscmon /var/lib/tscmon /etc/tscmon sudo chmod -R 750 /var/lib/tscmon
    5. If binary is corrupted, replace from a verified release and restart:

      Code

      sudo systemctl restart tscmon

2. High CPU or memory usage

  • Symptoms: TscMon process consumes excessive CPU/RAM, causing system slowdowns.
  • Likely causes: Heavy polling frequency, large number of monitored targets, memory leak, or inefficient plugin.
  • Fixes:
    1. Identify offending process threads:

      Code

      top -p $(pgrep -d, -f tscmon) sudo perf top -p
    2. Reduce polling frequency and batch checks in config (increase intervals, add jitter).
    3. Temporarily disable nonessential plugins/modules to isolate the culprit.
    4. Update to latest TscMon release (may include performance fixes).
    5. If memory leak suspected, enable core dumps and collect heap profile (if supported). Restart service after collecting diagnostics.

3. Missing or stale metrics in dashboard

  • Symptoms: Dashboard shows no data or old timestamps.
  • Likely causes: Ingestion pipeline stalled, time synchronization issues, or exporter failures.
  • Fixes:
    1. Verify TscMon is publishing metrics (check local metrics endpoint, e.g., http://localhost:9100/metrics).
    2. Inspect ingestion logs (message queue, TSDB) for errors or backpressure.
    3. Confirm system time is correct:

      Code

      timedatectl status sudo ntpstat || sudo systemctl restart systemd-timesyncd
    4. Check exporters on monitored hosts are reachable and running; test connectivity with curl or telnet.
    5. Clear any metric ingestion queues if they’re backed up, then restart the ingestion component.

4. Alerts not firing or firing incorrectly

  • Symptoms: Expected alerts absent, or alerts fire too frequently/with wrong severity.
  • Likely causes: Alerting rule misconfiguration, wrong thresholds, silences/maintenance windows active, or time-window misalignment.
  • Fixes:
    1. Review alert rules for logic errors (evaluation window, aggregation, labels).
    2. Test rules locally with sample metric data (use tscmon alert-testing tool or query language).
    3. Check for active silences or muted receivers.
    4. Verify notification channels (email, PagerDuty, Slack) are configured and credentials valid.
    5. Adjust thresholds and add annotations explaining rationale.

5. Authentication or authorization failures

  • Symptoms: Users cannot log in, API calls return ⁄403.
  • Likely causes: Token expiry, misconfigured OAuth/OIDC, incorrect role mappings, or LDAP issues.
  • Fixes:
    1. Check authentication provider status and logs (OIDC, LDAP).
    2. Validate client IDs, secrets, and callback URLs.
    3. Inspect role/permission mappings in TscMon config.
    4. Rotate or refresh tokens if expired; ensure time sync between systems.
    5. Test API with an admin token to isolate client vs

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *