Background
About a half year ago, I started using LibreNMS, a handy SNMP network device monitoring software. Since then, I use it to manage about 30 Cisco L2/L3 switches (of course, managed), all exist at a single manufacturing site.
Though I joined the site only one year ago, the site itself is decades old and network personnel wasn't always at the site. As a consequence, beyond those 30 Cisco managed switches, there are countless desktop unmanaged switches. Most of them are still 100Mbps. Note that "desktop" is just a category name, actual locations vary. Below desk, on floor, under floor, or over ceiling.
Situation in detail
Our installation of LibreNMS shows some error counts in ports like below.
I ignored them for a few months, since users didn't report issues. These errors were annoying only for me when I watch LibreNMS. But over time, I gradually noticed a several ports are frequently listed as "Errored" among over 1,000 ports.
However, default 5 minute update cycle is too short to conclude which port to investigate further. Ports once appearing as "errored" often disappears when I reload the WebUI. To prioritize which of them to be searched and replaced, I want to know total traffic and frame error counts per port for longer interval like 1-day or 1-week.
Though there are some requests found as below, not yet implemented in LibreNMS core.
https://community.librenms.org/t/port-errors-page-port-errors-vs-errors-delta/3045
https://community.librenms.org/t/generating-network-monthly-report/1383
I decided to investigate possibility by myself.
LibreNMS architecture
In a sense, LibreNMS is a human friendly viewer of SNMP information. LibreNMS poller collects SNMP information from devices every 5 minute, and then parse them and load them into SQL database (specifically, MariaDB).
Basic architecture is shown on right (image can be enalrged by clicking.)
Information (more specifically, SNMP counters like ifOutErrors and ifOutOctets) about ports are written to ports
table, but overwritten everytime poller
runs.
This behavior is understandable since LibreNMS primarily uses SQL database for real-time, up-to-date SNMP information. Histories are recorded mostly in RRD.
But for my purpose, numbers are preferred over graphs. So, keeping history of ports
table is necessary.
Options considered
Then the next problem is how to keep that history. I did some research and came up with following options.
- Writing completely separate script from LibreNMS to collect SNMP counters.
- Adding daily process to PHP code.
- Audit / history table using SQL Trigger.
- System Versioned Tables
Considering factors like below, I concluded to implement this as SQL side.
- Maintaining device list separately from LibreNMS would be cumbersome.
- LibreNMS PHP code is intended to be automatically updated, and my modification would not be suitable for all users.
- SQL trigger is executed with every update, which would be too often.
- System Versioned Tables seem not easily handle data handling across versions (my usecase requires tracking each port over days)
Actual implementation: recording part
My implementation is two-step. First, record ports
table daily, anyway. Second, analyze that kept table later by Python script.
Here's the first part.
Create ports_hist
table based on ports
as below.
CREATE TABLE ports_hist SELECT *, NOW() AS captured_at FROM ports;
Then, next thing is to record all rows of ports
into ports_hist
daily. I thought about cron
, but discovered MySQL has very nice feature for this particular situation: Event Scheduler. This feature enables to keep my implementation completely in SQL.
Since event scheduler is OFF by default, don't forget to turn it on.
# vi /etc/mysql/mariadb.conf.d/50-server.cnf
event_scheduler=ON
Then, add daily capturing configuration on MySQL.
CREATE EVENT capture_ports_hist
ON SCHEDULE EVERY 1 DAY STARTS '2024-08-02 00:31'
DO
INSERT INTO ports_hist SELECT *, NOW() FROM ports;
In the next morning, I could see data like below.
SELECT port_id, `ifOutOctets`, `ifInOctets`,captured_at
FROM ports_hist
WHERE captured_at < '2024-08-03' AND captured_at > '2024-08-02';
+---------+--------------+---------------+---------------------+
| port_id | ifOutOctets | ifInOctets | captured_at |
+---------+--------------+---------------+---------------------+
| 1 | 1243047387 | 1243047387 | 2024-08-02 00:31:00 |
| 2 | 2652452145 | 16019563426 | 2024-08-02 00:31:00 |
| 3 | 35039656033 | 132556547082 | 2024-08-02 00:31:00 |
| 4 | 839009919291 | 25483043737 | 2024-08-02 00:31:00 |
| 5 | 154018108298 | 832475169270 | 2024-08-02 00:31:00 |
+---------+-------------+---------------+---------------------+
...
Before writing Python script to analyze these data, I waited for some more days until some meaningful data piled up. So, analyzing part to be continued in the next article.
Appendix on addtional counters and DB schema (2024-10-15)
One day, I noticed some ports report ifOutDiscards
in addition to ifOutErros
observed above. Since ifOutDiscards
and some other statistics are recorded not in ports
table but in ports_statics
table, I changed my capturing event scheduler as below.
CREATE EVENT capture_ports_combined_hist
ON SCHEDULE EVERY 1 DAY STARTS '2024-09-30 00:31'
DO
INSERT INTO ports_combined_hist
SELECT a.port_id
,ifInErrors
,ifOutErrors
,ifInOctets
,ifOutOctets
,ifInNUcastPkts
,ifOutNUcastPkts
,ifInDiscards
,ifOutDiscards
,ifInUnknownProtos
,ifInBroadcastPkts
,ifOutBroadcastPkts
,ifInMulticastPkts
,ifOutMulticastPkts
,NOW() AS captured_at
FROM ports a
LEFT JOIN ports_statistics b ON a.port_id = b.port_id;
As a result, I have multiple custom tables in librenms
schema. But that causes error in LibreNMS ./validate.php
. I also think I would add a few more tables like these in future. So, I migrated custom tables to newly created additional schema netadmin_datamart
.
Also, I created a MySQL user netadmin
to do this kind of analysis with read-only privilege to librenms
and read-write to netadmin_datamart
.
I would not write details about these processes to keep my article simple.