GCP — Set up alerts for particular logs
Alert when systemd timer does not trigger the systemd service
I have set up some cron jobs using the systemd timer on a VM on GCP. The cron jobs are there to feed in new data into our database. In the beginning, the cron jobs were running perfectly, but then, all of a sudden, they stopped running, and we noticed it when our users started complaining that they were not seeing any new data. After having a look at the cronjobs, I saw that they stopped working almost 2 weeks ago! What a disaster!!!
This problem needed a solution. A solution that would 🚨 alert 🚨 us whenever the cron jobs would stop running.
This blog post makes use of GCP and the following services offered by GCP: Logging, Monitoring, and Compute Engine. It also uses the logs that were generated through the Linux systemd service.
However, anyone looking forward to setting up an alert system on GCP based on their particular logs may benefit from this blog post.
P.S: You can easily spin up VM on GCP and create a systemd service which just
echo’s something and a systemd timer which calls the service every 10 minutes.
GCP Logs Explorer
Capturing the logs from the VM
Before we set up the 🚨 alert system 🚨, we would have to capture logs to get the output from the systemd service. You will have to install the agent provided by GCP on your VM to collect the logs from the VM. You may install the agent by running this command in the VM:
curl -sSO https://dl.google.com/cloudagents/add-logging-agent-repo.sh
sudo bash add-logging-agent-repo.sh — also-install
For further information on installing the agent on your VM, have a look at this link.
NOTE: You may incur charges for collecting logs from your
Once you have set up the GCP Logs Explorer to collect logs from the VM, you can head over to the Logs Explorer by selecting the VM and then hitting the “Cloud Logging” link. This will lead you to the Logs Explorer service where you will be seeing all the logs from the VM.
NOTE: We will be filtering out the systemd logs stating that the systemd service was completed successfully. For this you will have to have run the systemd service that did complete successfully.
Having a look at the format of the last log from the systemd service which stated it did complete successfully, we can see that it has a
jsonPayload.message which we can use to filter all the logs from the systemd service that give us the evidence that the service did, in fact, complete successfully.
NOTE: For me, the systemd service file is named remittance-cronjobs.service, for you it may be different depending on what you named your systemd service file.
Up in the query search box, add additional filter;
Your query search box would now look similar to this:
Next, hit on the
Run query button, and you will see all the logs, for a particular time frame, for your systemd service standing as a witness that the service did succeed in running.
Copy the query filters…
Now that we have known how to filter our logs and get the logs that we need, let us create a metric based on our filters.
While still under Logging, head over to Logs-based Metrics.
CREATE METRIC, and make sure you have these settings:
- For the Metric Type, make sure
Counterhas been selected.
- Under Details, type in the details 😉 (only name is necessary, description is up to you, and leave the Units field blank)
- Under Filter Selection, PASTE in the query filters (remember I had told you to copy those)
- Add Labels as you see fit
🚨 ALERT 🚨
All rigth!!! Now to our final stage; setting up an alert if our cron job fails or does not run!!!
As we had mentioned previously, when our cron job succeeds, that is when the systemd service runs successfully (remember that the systemd timer is executing the systemd service), we should see this in our logs:
If it goes missing, then that means that the cron job did not run or it failed. In my case, I am running the cron jobs every hour, so I should expect to see those success logs in there after every hour. If I don’t see the success log in a window of 2 hours, then I want the system to alert me.
After having created our log-based metric, we would like to create an alert for it. We can do so without leaving the Logs-based Metrics page;
By clicking on the
Create alert from metric, we would be taken to a page similar to this:
Here, click on the
Query Editor, found next to the
Target headline, and add this line at the bottom:
| absent_for 2h
This basically means that the alert will get fired when the metric is absent for 2 hours. Feel free to change the
2h (2 hours) to anything that will fit your specific needs.
Annnnnd thats it! Just save your alert, fill out all the other remaining details and you are good to GO!