Check API
A Check is a runtime test to make sure that something is working well. You can think of Checks as similar and complimentary to the PHPUnit and Acceptance testing but the next layer around them, and performed at run time rather than development, or build time.
Like other forms of testing the tests themselves should be easy to read, to reason about, and to confirm as valid.
Many types of runtime checks cannot be unit tested, and often the checks are the test.
Checks can be used for a variety of purposes including:
- configuration checks
- security checks
- status checks
- performance checks
- health checks
Moodle has had various types of checks and reports for a long time but they were inconsistent and not machine readable. In Moodle 3.9 they were unified under a single Check API which also enabled plugins to cleanly define their own additional checks. Some examples include:
- a password policy plugin could add a security check
- a custom authentication plugin can add a check that the upstream identity system can be connected to
- a MUC store plugin could add a performance check Having these centralized and with a consistent contract makes it much easier to ensure the whole system is running smoothly. This makes it possible for an external integration with monitoring systems such as Nagios / Icinga. The Check API is exposed as an NPRE compliance cli script:
php admin/cli/checks.php
Result states of a check
Status | Meaning | Example |
---|---|---|
N/A | This check doesn't apply - but we may still want to expose the check | secure cookies setting is disabled because site is not https |
Ok | A component is configured, working and fast. | ldap can bind and return value with low latency |
Info | A component is OK, and we may want to alert the admin to something non urgent such as a deprecation, or something which needs to be checked manually. | |
Unknown | We don't yet know the state. eg it may be very expensive so it is run using the Task API and we are waiting for the answer. NOTE: unknown is generally a bad thing and is semantically treated as an error. It is better to have a result of Unknown until the first result happens, and from then on it is Ok, or perhaps Warning or Error if the last known result is getting stale. If you are caching or showing a stale result you should expose the time of this in the result summary text. | A complex user security report is still running for the first time. |
Warning | Something is not ideal and should be addressed, eg usability or the speed of the site may be affected, but it may self heal (eg a load spike) | auth_ldap could bind but was slower than normal |
Error | Something is wrong with a component and a feature is not working | auth_ldap could not connect, so users cannot start new sessions |
Critical | An error which is affecting everyone in a major way | Cannot read site data or the database, the whole site is down |
How the various states are then leveraged is a local decision. A typical policy might be that health checks with a status of 'Error' or 'Critical' will page a system administrator 24/7, while 'Warning' only pages during business hours.