When parsing Apache web server logs on Linux, I find it interesting to monitor access requests resulting in HTTP status codes other than 200s. An HTTP status code in the 200s means the request was successful, and hey–that’s boring.
I want to see the requests that my dear Apache instance is upset about. So the question becomes…how do I filter the logs to show me every entry that doesn’t have a status code in the 200s?
Let’s back our way into this. We’ll start with the answer, then explain how we got there.
The Answer
This CLI incantation will get the job done.
sudo grep -E '\" [1345][01235][0-9] [[:digit:]]{1,8} \"' /var/log/apache2/access.log
If you’d like to watch the log entries scroll by in real time, try this.
sudo tail -f /var/log/apache2/access.log | grep -E '\" [1345][01235][0-9] [[:digit:]]{1,8} \"'
Comprehending The Regex
Let’s focus on the regular expression (regex) grep is using to find the matches. In plain English, the grep utility is using an extended -E regex to display all lines in the file /var/log/apache2/access.log matching the regex.
The regex portion of the command is as follows.
'\" [1345][01235][0-9] [[:digit:]]{1,8} \"'
The regex is enclosed in single quotes so that grep knows where it begins and ends. Let’s walk through the regex to see what it’s telling grep to look for.
- \” a quotation mark
- a space
- [1345] any of 1, 3, 4, or 5
- [01235] any of 0, 1, 2, 3, or 5
- [0-9] any number
- a space
- [[:digit:]]{1,8} any number from 1 to 8 digits long
- a space
- \” a quotation mark
Regex is a powerful tool, and there are likely other ways to get the job done. I’m showing you a way that worked for me, knowing that there are possibly more elegant ways if my regex-fu was mightier.
Why Does This Work?
To see why this regex will show us lines with non-200 status codes, let’s look at this example Apache log entry.
112.170.115.206 - - [06/Apr/2022:13:23:15 +0000] "GET /feed/ HTTP/1.1" 301 590 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)"
Your Apache logs might look different–take a look at your LogFormat directives in your /etc/apache2/*.conf files, as your LogFormat definition might mean you have to update your regex if your LogFormat is substantially different from mine.
The bit we care about is the middle of the entry, where it says 301 590. The field we care about specifically is the one containing the 301–the HTTP status code for this particular response. The second number is the size of the object Apache sent back to the client. We don’t care about that number’s value, but we do care that there is a number there. That helps us be confident that the previous number is the status code we’re concerned with.
Here are some other things we can assume based on what we know about HTTP status codes and the way Apache is formatting our logs.
- These 2 numbers are going to be preceded by a quote and a space and followed by a space and a quote.
- Status codes are defined by HTTP standards. They aren’t just anything. There are 100s, 200s, 300s, 400s, and 500s. If want to eliminate 200s, we require the first digit to be a 1, 3, 4 or 5. By the same logic, we could use [4] if we only wanted to see 400s, or [145] if we wanted to eliminate the 200s and 300s. You get the idea.
- The middle digit of currently defined status codes only contain 0, 1, 2, 3, or 5. So, we can insist on one of those values in the middle position.
- The last digit of currently defined status codes might contain any number. So, we care that there’s a number in that third position, but that’s all the filtering we can do.
Baking all of these assumptions into the regex means that we reduce the chance of the regex matching lines we don’t actually care about. Reducing false positives is important so that we can assume the log entries the grep is showing us are interesting or even actionable.
A Better Way
CLI tools are nice and so on–good for diagnostics and general neckbearding. But what you really want is a log parsing engine that ingests all your log data and summarizes things like interesting HTTP status codes for you. For instance, I’ve been messing with the free tier of Grafana Cloud lately, although I’ve only plumbed it to NGINX so far. I haven’t tried it with Apache logs yet. A project for another day.
For More Information
Regular Expressions in Grep (Regex) – Linuxize
HTTP Response Status Codes – Mozilla Developer Network
Apache 2.4 Log Files – Apache HTTP Server Project Documentation