Error Log Definition
An error log is a file that contains detailed records of error conditions a computer software encounters when it’s running. The name is generic: sometimes, an application can log non-error type messages in its error log. However, error logs are generally meant to record only error messages generated by a program. These programs can be server or network operating systems or third-party applications. In this article, we will focus on application error logs only.
When an application experiences an outage or performance problem, the quickest way to find the root cause is to look for its error log and check the error messages. High-quality error logs provide enough information to begin troubleshooting. They tell you what happened, when it happened, how critical it was, and maybe even the name of the offending module or a trace ID to correlate with other events.
Error logs are invaluable for powering traditional monitoring and security information and event management (SIEM) systems. These monitoring solutions can parse and identify critical errors from the logs, show historical trends of similar errors, and send alerts for security-related incidents.
This article will show you what a typical application error log contains, how it can support your operations team, and how to get the maximum value from it.
What Do Error Logs Contain?
Error logs can have two types of application errors: unhandled error messages and custom error messages. An unhandled error message is also called an untrapped message. This is because untrapped messages are not handled by the developer’s code. Sometimes, an application’s libraries or runtimes will throw an error. These runtimes are not written by the application developer but added by the compiler during the build phase. Examples of such unhandled errors include variable type mismatch, divide by zero, and so on.
On the other hand, custom error messages are logged by exception handlers in the program code. These are error conditions the developer has anticipated and written code for. For example, a banking application may log an error when the user tries to withdraw more than the current balance, and the message will likely be more human-readable.
Whether or not an error log is helpful depends on the level of information it records for each error event. Without sufficient details, it’s difficult to take remedial action. Different applications will provide varying details for their errors, but well-designed error log entries are structured and have some common fields:
This field shows the date and time the error occurred. Ideally, it should also include the time zone. This is especially useful for distributed systems. The ISO 8601 standard is a commonly adopted format. It looks like this:
2022-04-15T14:19:10+00:00 2022-04-15T14:20:10+00:00 …
Most log entries include a level to identify its criticality that indicates if immediate attention is necessary. Commonly used levels are TRACE, DEBUG, INFO, WARN, ERROR, and FATAL, with TRACE being the least important and FATAL being the most critical. Monitoring systems can automatically send alerts based on these criticality levels and trigger automated actions.
For example, a volume running out of disk space can be logged with a level of WARN. When a monitoring solution finds this error, it can send an alert message to the operations team, followed by creating a support ticket. On the other hand, a completely saturated disk space may be logged as CRITICAL with the monitoring solution automatically extending the drive.
This field shows the network username associated with the error — typically, it’s the system user’s action that caused the error. Usernames can be helpful for troubleshooting or conducting historical analysis. For example, you could analyze error logs to identify if some users are experiencing more errors than others. However, not every log event has a user associated with it.
This is a brief explanation of the error. Error logs are often accessed in time-sensitive situations, so it’s essential the error description is succinct but provides necessary information. For example, when a user can’t access an application, a simple error description like “Access Denied” isn’t really helpful. Instead, a better description would be “Access Denied: Insufficient Privileges.”
Besides the common fields, other attributes you’ll often see in error logs may include:
- Error IDs: Error IDs are used to uniquely identify each type of error.
- IP Addresses: Some error messages show the IP addresses of the source and destination devices.
- Device or Server: This can be the device’s network name or IP address where the application threw an error.
Error Log Benefits
Since error logs are like the black boxes of an airplane; it’s the first port of call for most support engineers. Here are some of its benefits:
Improved Resolution Times
Error logs help reduce the mean-time-to-resolution (MTTR) of your IT environment, especially when ingested into a modern log management system. These log management solutions allow you to filter, search, and find errors you are interested in, drill down on specific field values, correlate events between multiple error logs, and predict possible future issues. All these can lead to proactive measures which further reduce the chances of downtimes.
Dashboards, trend charts, top N errors by importance, and various reports from log management solutions can help you identify which errors are critical, how the affected systems are performing, and if they are worth looking at immediately. Similarly, patterns in the error logs can indicate hidden problems and enable teams to take quick, proactive steps, preventing customer complaints.
Error logs can highlight application performance issues. You can identify when your application hangs, runs into memory issues, or has a low throughput. Analyzing logs over time can unearth common circumstances of performance bottlenecks. For example, examining an error log containing abandoned shopping cart events may show your ecommerce application is experiencing performance issues during Black Friday sales. This, in turn, can prompt you to scale up your infrastructure during those periods only.
Error logs are crucial for troubleshooting security incidents. Analyzing historical records of security-related errors can help you separate normal versus suspicious behavior. For example, if you find an application has a consistent pattern of multiple failed login attempts across several user accounts, then you may want to find out if those users are still active and ask them to use more secure passwords or even use two-factor authentication. With a security orchestration, automation, and response (SOAR) tool, you can take this further by adding an automated action of disabling the accounts.
Getting the Most Out of Error Logs
Despite the obvious benefits, error logs can often be highly verbose. Ingesting, parsing, and indexing many of these events can be time-consuming for log management systems. That’s why it’s best to follow some general principles to get the most out of error logs.
It’s important to filter out unnecessary events from the error log and capture only those you are interested in. You can do this by configuring the application to log only certain types of events or those with a particular criticality level and above. Another option is filtering out only relevant events and sending those to the logging application.
Decide What to Do With Events
Use the criticality level to decide the type of action to take. For example, you may want to trigger alerts and automated actions for CRITICAL or FATAL errors and create problem tickets for anything with a WARN level. To define such activities, you need feedback from your business and application owners, technical leads, and operations teams.
Alert Only Relevant Teams
Once you know what events you want to capture and what you want to do with those, make sure that relevant teams receive alerts about those. Unnecessarily sending alerts to everyone can cause error fatigue, resulting in potentially missing important events. For example, the infrastructure team should receive storage-related errors, while the SecOps team should receive security-related errors.
Analyze Errors Over Time
Even when not troubleshooting an issue, you can look at historical trends of errors for anomalies and compare over similar periods in the past. These trends can be useful for baselining and benchmarking. For example, when you see performance-related errors creeping up after CPU usage hits over 80% or the API client connection rate going beyond 50 per second, you know those are the baseline figures. You can then use those for benchmarking when adding more infrastructure capacity.
Make Alerts Actionable
Alerts triggered from errors should always have a clear, agreed-upon action plan. A RACI matrix is a valuable tool to identify the key players when alerts are fired. Similarly, playbooks are useful for automated actions. Operations teams should leverage automation features of their log management solutions to speed up response times and improve the quality of service.
Log Everything, Answer Anything – For Free
Falcon LogScale Community Edition (previously Humio) offers a free modern log management platform for the cloud. Leverage streaming data ingestion to achieve instant visibility across distributed systems and prevent and resolve incidents.
Falcon LogScale Community Edition, available instantly at no cost, includes the following:
- Ingest up to 16GB per day
- 7-day retention
- No credit card required
- Ongoing access with no trial period
- Index-free logging, real-time alerts and live dashboards
- Access our marketplace and packages, including guides to build new packages
- Learn and collaborate with an active community