Regular expressions offer something that automata do not: a declarative way to express the strings we want to accept. This is why we use it as the input language for our platform to process logs in many heterogeneous formats. When we learn how to extract key values using RegEx from any log format we can start to think how to apply it to some of the more popular log formats. In this blog post we take the most popular log formats for web servers: Apache, Nginx, IIS and create community packs using our new RegEx field extraction, which allow you to easily analyse, understand and search logs from these platforms.

Apache & Nginx

Within Apache, the default logs of the access log looks like this:

64.212.81.11 </span> - - [07/Mar/2004:16:20:55 -0800] "GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1" 200 5253

Next, I will show you how to extract values from this log so you can get a better understanding of how this technique can be used to create powerful searches across any log format.  The use of a third party RegEx validator for convenience and quick reference is probably good idea also 🙂

There are 3 valuable pieces of information that I will show you how to extract:

  • Requested URL
  • HTTP Status Code
  • Total number of bytes requested

Once we extract this information, we can assign keys to values and create very interesting visualizations using RegEx e.g. we can search for all URLs in the logs that produced 401/404 status codes, monitor and troubleshot if necessary.

Use the common knowledge

HTTP status codes are composed of 3 digits only, which is a great piece of information to have in hand. Perhaps, we can use a simple *\s\d{3} *expression to extract any 3 digit number, however it can throw a false positive, because it comes right before the total number of bytes, which can also be made of a 3 digits, so we need to be careful. We also notice that inverted commas used to close the URL path appear just before the HTTP status code. If this is the case “\s\d{3} might work. Again this would capture inverted commas also, which we do not intend to do, hence the final regular expression should look like this: \” (?P\b\d{3}), which directly corresponds to: Capture any 3 digit number, which comes right after inverted commas and a white space and then give it a key “httpcode"

List of URLs that responded with a 401 HTTP Status Code in the past 7 days

Nginx log format is almost identical to Apache. So you can apply the above search to your Nginx logs also.

Internet Information Services (IIS)

Within IIS, the default logs of the access log looks like this:

192.168.114.201, -, 03/20/01, 7:55:20, W3SVC2, SERVER, 172.21.13.45, 4502, 163, 3223, 200, 0, GET, /DeptLogo.gif, -,

The IIS logs are more extensive, they contain more information as well as “comma-separated” value fields, which is great from the RegEx perspective. Note the new RegEx functionality works perfectly well with our search functions (e.g. SUM, AVERAGE…). A great example of that is e.g. the ability to calculate the average time it takes for the server to retrieve a .jpg file over a specific period of time. To get a better understanding of how this is achieved I describe the search function used in the figure below.

RegEx Expression to extract the total time of a request

/\d{3}\.\d{1,}\.\d{1,}\.\d{1,}\, (?P<time>\d{1,})/</time>

Search for all entries ending with .jpg

/.jpg

Final Search Function

/\d{3}\.\d{1,}\.\d{1,}\.\d{1,}\, (?P<time>\d{1,})/ AND /.jpg calculate(AVERAGE:time)</time>

We can calculate the avarage time to retrieve a .jpg file from a server over time

New Community Packs

I have created 3 new community packs that show how you can use RegEx to extract values from Apache, NginX and IIS logs. This essentially means that you can now start to build some nice dashboards and analyse your web server logs in a lot more detail. E.g. you might want to get a break down of URLs by response time or by request size.

You can check out these new community packs here:

Good practices for creating Community Packs

Finally I will outline the potential issues you might encounter when creating the Community Pack and share few hints on how to get the most out of RegEx search functionality.

  • Keep your JSON clean – Our community packs are defined using a JSON structure. JSON files are meant to be somewhat human readable. Whatever your level of experience with the JSON format is, try to keep the Community Pack JSON file structure clean and nicely formatted and validate it every so often to avoid the parser fails.
  • Escape special characters– when using RegEx in your JSON file, you need to escape most of the special characters. If your standard regular expression is in this form: \b\s\d{3}, you will need to write it as: \b\s\d{3}
  • Master your RegEx– there is always more than just one solution… Look for the best RegEx expression and remember that short isn’t necessary always the best.
  • Look out for white spaces – when you retrieve values from the log using RegEx, make sure you drop all the white spaces at both of the ends of the extracted value. Why? You might not be able to look up or compare extracted values later on.