Webalizer - Web Server Statistics (webalizer.conf)



Home


Webalizer is a program that looks at web server logs and produces detailed reports. The generated pages show the amount of hits your server received, pages clients went to and much more. It it a great tool you can use to better understand what people are looking for on your site and how you can better improve your content.

The Webalizer is a fast, free web server log file analysis program. It produces highly detailed, easily configurable usage reports in HTML format, for viewing with a standard web browser.





Can you show me an example of the report?

Sure. You can see an example report of what the Calomel.org config will generate here: Webalizer Example Report.





Looking at the webalizer.conf configuration

The configuration file will tell Webalizer what kind of reports we want generated. It will specify the location the resulting html pages will go and in what format. In this exercise we will be asking Webalizer for a general report for a site with the URL YOUR_HOST.com. Every report type is enabled in the example config file so you can get an idea of what you want to use.

NOTE: as of OpenBSD 4.3, packages and the ports collecting now contain Webalizer Xtended (RB21). This version adds extra repoting functions over standard Webalizer. If you have not already, we highly suggest upgrading to Webalizer Xtended.

Please take some time and look the config file now. Make sure to change the example URL name YOUR_HOST.com to your real URL host name. All of the directives are fully commented.

Download the config here by doing a "save as" or just clicking on the link and choosing download. Before using the configuration file take a look it below or download it and look at the options. calomel.org webalizer.conf

#######################################################
###  Calomel.org  webalizer.conf   BEGIN
#######################################################
# LogFile defines the web server log file to use.  If not specified
# here or on on the command line, input will default to STDIN.  If
# the log filename ends in '.gz' (ie: a gzip compressed file), it will
# be decompressed on the fly as it is being read.

LogFile        /var/log/WEB_LOGS/access.log

# LogType defines the log type being processed.  Normally, the Webalizer
# expects a CLF or Combined web server log as input.  Using this option,
# you can process ftp logs as well (xferlog as produced by wu-ftp and
# others), or Squid native logs.  Values can be 'clf', 'ftp' or 'squid',
# with 'clf' the default.

LogType clf

# OutputDir is where you want to put the output files.  This should
# should be a full path name, however relative ones might work as well.
# If no output directory is specified, the current directory will be used.

OutputDir      /var/www/htdocs/webalizer

# HistoryName allows you to specify the name of the history file produced
# by the Webalizer.  The history file keeps the data for up to 12 months
# worth of logs, used for generating the main HTML page (index.html).
# The default is a file named "webalizer.hist", stored in the specified
# output directory.  If you specify just the filename (without a path),
# it will be kept in the specified output directory.  Otherwise, the path
# is relative to the output directory, unless absolute (leading /).

HistoryName     webalizer.hist

# Incremental processing allows multiple partial log files to be used
# instead of one huge one.  Useful for large sites that have to rotate
# their log files more than once a month.  The Webalizer will save its
# internal state before exiting, and restore it the next time run, in
# order to continue processing where it left off.  This mode also causes
# The Webalizer to scan for and ignore duplicate records (records already
# processed by a previous run).  See the README file for additional
# information.  The value may be 'yes' or 'no', with a default of 'no'.
# The file 'webalizer.current' is used to store the current state data,
# and is located in the output directory of the program (unless changed
# with the IncrementalName option below).  Please read at least the section
# on Incremental processing in the README file before you enable this option.

Incremental    yes

# IncrementalName allows you to specify the filename for saving the
# incremental data in.  It is similar to the HistoryName option where the
# name is relative to the specified output directory, unless an absolute
# filename is specified.  The default is a file named "webalizer.current"
# kept in the normal output directory.  If you don't specify "Incremental"
# as 'yes' then this option has no meaning.

IncrementalName        webalizer.current

# ReportTitle is the text to display as the title.  The hostname
# (unless blank) is appended to the end of this string (seperated with
# a space) to generate the final full title string.
# Default is (for english) "Usage Statistics for".

ReportTitle    Usage Statistics for

# HostName defines the hostname for the report.  This is used in
# the title, and is prepended to the URL table items.  This allows
# clicking on URL's in the report to go to the proper location in
# the event you are running the report on a 'virtual' web server,
# or for a server different than the one the report resides on.
# If not specified here, or on the command line, webalizer will
# try to get the hostname via a uname system call.  If that fails,
# it will default to "localhost".

HostName       YOUR_HOSTNAME.com

# PageType lets you tell the Webalizer what types of URL's you
# consider a 'page'.  Most people consider html and cgi documents
# as pages, while not images and audio files.  If no types are
# specified, defaults will be used ('htm*', 'cgi' and HTMLExtension
# if different for web logs, 'txt' for ftp logs).

PageType        htm*

# DNSCache specifies the DNS cache filename to use for reverse DNS lookups.
# This file must be specified if you wish to perform name lookups on any IP
# addresses found in the log file.  If an absolute path is not given as
# part of the filename (ie: starts with a leading '/'), then the name is
# relative to the default output directory.  See the DNS.README file for
# additional information.

DNSCache        dns_cache.db

# DNSChildren allows you to specify how many "children" processes are
# run to perform DNS lookups to create or update the DNS cache file.
# If a number is specified, the DNS cache file will be created/updated
# each time the Webalizer is run, immediately prior to normal processing,
# by running the specified number of "children" processes to perform
# DNS lookups.  If used, the DNS cache filename MUST be specified as
# well.  The default value is zero (0), which disables DNS cache file
# creation/updates at run time.  The number of children processes to
# run may be anywhere from 1 to 100, however a large number may effect
# normal system operations.  Reasonable values should be between 5 and
# 20.  See the DNS.README file for additional information.

DNSChildren     20

# HTMLPre defines HTML code to insert at the very beginning of the
# file.  Default is the DOCTYPE line shown below.  Max line length
# is 80 characters, so use multiple HTMLPre lines if you need more.

HTMLPre <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

# HTMLHead defines HTML code to insert within the <HEAD></HEAD>
# block, immediately after the <TITLE> line.  Maximum line length
# is 80 characters, so use multiple lines if needed.

HTMLHead <META NAME="calomel.org" CONTENT="The Webalizer">

# HTMLBody defined the HTML code to be inserted, starting with the
# <BODY> tag.  If not specified, the default is shown below.  If
# used, you MUST include your own <BODY> tag as the first line.
# Maximum line length is 80 char, use multiple lines if needed.

HTMLBody <BODY BGCOLOR="#E8E8E8" TEXT="#000000" LINK="#0000FF" VLINK="#FF0000">

# The Quiet option suppresses output messages... Useful when run
# as a cron job to prevent bogus e-mails.  Values can be either
# "yes" or "no".  Default is "no".  Note: this does not suppress
# warnings and errors (which are printed to stderr).

Quiet           no

# TimeMe allows you to force the display of timing information
# at the end of processing.  A value of 'yes' will force the
# timing information to be displayed.  A value of 'no' has no
# effect.

TimeMe          yes

# GMTTime allows reports to show GMT (UTC) time instead of local
# time.  Default is to display the time the report was generated
# in the timezone of the local machine, such as EDT or PST.  This
# keyword allows you to have times displayed in UTC instead.  Use
# only if you really have a good reason, since it will probably
# screw up the reporting periods by however many hours your local
# time zone is off of GMT.

GMTTime         no

# VisitTimeout allows you to set the default timeout for a visit
# (sometimes called a 'session').  The default is 30 minutes,
# which should be fine for most sites.
# Visits are determined by looking at the time of the current
# request, and the time of the last request from the site.  If
# the time difference is greater than the VisitTimeout value, it
# is considered a new visit, and visit totals are incremented.
# Value is the number of seconds to timeout (default=1800=30min)

VisitTimeout    1800

# Country Graph allows the usage by country graph to be disabled.
# Values can be 'yes' or 'no', default is 'yes'.

CountryGraph    yes

# DailyGraph and DailyStats allows the daily statistics graph
# and statistics table to be disabled (not displayed).  Values
# may be "yes" or "no". Default is "yes".

DailyGraph      yes
DailyStats      yes

# HourlyGraph and HourlyStats allows the hourly statistics graph
# and statistics table to be disabled (not displayed).  Values
# may be "yes" or "no". Default is "yes".

HourlyGraph     yes
HourlyStats     yes

# GraphLegend allows the color coded legends to be turned on or off
# in the graphs.  The default is for them to be displayed.  This only
# toggles the color coded legends, the other legends are not changed.
# If you think they are hideous and ugly, say 'no' here :)

GraphLegend     yes

# GraphLines allows you to have index lines drawn behind the graphs.
# I personally am not crazy about them, but a lot of people requested
# them and they weren't a big deal to add.  The number represents the
# number of lines you want displayed.  Default is 2, you can disable
# the lines by using a value of zero ('0').  [max is 20]
# Note, due to rounding errors, some values don't work quite right.
# The lower the better, with 1,2,3,4,6 and 10 producing nice results.

GraphLines      2

# The "Top" options below define the number of entries for each table.
# Defaults are Sites=30, URL's=30, Referrers=30 and Agents=15, and
# Countries=30. TopKSites and TopKURLs (by KByte tables) both default
# to 10, as do the top entry/exit tables (TopEntry/TopExit).  The top
# search strings and usernames default to 20.  Tables may be disabled
# by using zero (0) for the value.

TopSites        10
TopKSites       10
TopURLs         10
TopKURLs        10
TopReferrers    10
TopAgents       10
TopCountries    10
TopEntry        10
TopExit         10
TopSearch       50
TopUsers        10

# The Hide*, Group* and Ignore* and Include* keywords allow you to
# change the way Sites, URL's, Referrers, User Agents and Usernames
# are manipulated.  The Ignore* keywords will cause The Webalizer to
# completely ignore records as if they didn't exist (and thus not
# counted in the main site totals).  The Hide* keywords will prevent
# things from being displayed in the 'Top' tables, but will still be
# counted in the main totals.  The Group* keywords allow grouping
# similar objects as if they were one.  Grouped records are displayed
# in the 'Top' tables and can optionally be displayed in BOLD and/or
# shaded. Groups cannot be hidden, and are not counted in the main
# totals. The Group* options do not, by default, hide all the items
# that it matches.  If you want to hide the records that match (so just
# the grouping record is displayed), follow with an identical Hide*
# keyword with the same value.  (see example below)  In addition,
# Group* keywords may have an optional label which will be displayed
# instead of the keywords value.  The label should be seperated from
# the value by at least one 'white-space' character, such as a space
# or tab.
#
# The value can have either a leading or trailing '*' wildcard
# character.  If no wildcard is found, a match can occur anywhere
# in the string. Given a string "www.yourmama.com", the values "your",
# "*mama.com" and "www.your*" will all match.

# Your own site should be hidden
HideSite        *YOUR_HOSTNAME.com
HideSite        localhost

# Your own site gives most referrals
HideReferrer    localhost
HideReferrer    YOUR_HOSTNAME.com/

# This one hides non-referrers ("-" Direct requests)
#HideReferrer   Direct Request

# Usually you want to hide these
HideURL         *.gif
HideURL         *.GIF
HideURL         *.jpg
HideURL         *.JPG
HideURL         *.png
HideURL         *.PNG

# The MangleAgents allows you to specify how much, if any, The Webalizer
# should mangle user agent names.  This allows several levels of detail
# to be produced when reporting user agent statistics.  There are six
# levels that can be specified, which define different levels of detail
# supression.  Level 5 shows only the browser name (MSIE or Mozilla)
# and the major version number.  Level 4 adds the minor version number
# (single decimal place).  Level 3 displays the minor version to two
# decimal places.  Level 2 will add any sub-level designation (such
# as Mozilla/3.01Gold or MSIE 3.0b).  Level 1 will attempt to also add
# the system type if it is specified.  The default Level 0 displays the
# full user agent field without modification and produces the greatest
# amount of detail.  User agent names that can't be mangled will be
# left unmodified.
#
## DO NOT USE MangleAgents=1 or webalizer might SEG FAULT.

MangleAgents    5

# The SearchEngine keywords allow specification of search engines and
# their query strings on the URL.  These are used to locate and report
# what search strings are used to find your site.  The first word is
# a substring to match in the referrer field that identifies the search
# engine, and the second is the URL variable used by that search engine
# to define it's search terms.

SearchEngine    yahoo.com       p=
SearchEngine    altavista.com   q=
SearchEngine    google.com      q=
SearchEngine    eureka.com      q=
SearchEngine    lycos.com       query=
SearchEngine    hotbot.com      MT=
SearchEngine    msn.com         MT=
SearchEngine    infoseek.com    qt=
SearchEngine    webcrawler      searchText=
SearchEngine    excite          search=
SearchEngine    netscape.com    search=
SearchEngine    mamma.com       query=
SearchEngine    alltheweb.com   query=
SearchEngine    northernlight.com  qr=

#######################################################
###  Calomel.org  webalizer.conf   END
#######################################################







Starting the install

Step 1: Install Webalizer v2.x or greater from package or from source. For the example we are using the package from OpenBSD which is v2.0

Step 2: Place the webalizer.conf file from above into the /etc/webalizer.conf. You should backup the default conf file the package places there for future reference if you want to.

Take some time and look through the config file and familiarize yourself with the options. We tried to make sure all of the options used were fully documented to make it easier for future reference.

Step 3: Webalizer will save the report html pages in the web directory of your choice specified by "OutputDir". For our example we placed all of the files in "/var/www/htdocs/webalizer" so our web server can access them. Make sure your directory exists before executing Webalizer for the first time.





Executing Webalizer

Step 4: Webalizer generates new html report pages when executed. The best way to run it is from a cron job. In this example we will have Webalizer run once every 4 hours on the 10th minute.

This is an example of the crontab you can use.

#minute (0-59)
#|   hour (0-23)
#|   |    day of the month (1-31)
#|   |    |   month of the year (1-12)
#|   |    |   |   day of the week (0-6 with 0=Sun)
#|   |    |   |   |   commands
#|   |    |   |   |   |
#### Webalizer - Generate Web Report
10   */4  *    *   *   /usr/local/bin/webalizer >> /dev/null 2>&1 





In Conclusion

Now that Webalizer is active you should be able to point your web browser to your web server and look at the http://YOUR_HOST.com/webalizer url directory. If you have any problems or you notice your reports are incomplete then check the following Q&A to see if your questions is answered or check through the above instructions again incase you missed a step.





Questions?

Why does webalizer Segentation Fault ??

The problem occurs only if using a MangleAgent level of 1 and if some of the newer browsers that have multiple parens ('(' and ')') in their name access your site. The workaround is to use anything other than a value of 1 for the MangleAgent level like in the example above.

This is an example of the output of webalizer when it Segmentation Faults:

Webalizer V2.01-10 (OpenBSD 4.2) English
Using logfile /var/log/httpd/access.log (clf)
Creating output in /var/www/httpd/webalizer/
Hostname for reports is 'calomel.org'
History file not found...
Previous run data not found...
[new_snode] Warning: String exceeds storage size (76)
[new_snode] Warning: String exceeds storage size (76)
[new_snode] Warning: String exceeds storage size (72)
[new_snode] Warning: String exceeds storage size (77)
[new_snode] Warning: String exceeds storage size (89)
[new_snode] Warning: String exceeds storage size (79)
[new_snode] Warning: String exceeds storage size (81)
Segmentation Fault (core dump) (webalizer.core)

## If webalizer did not SegFault the output would have been...
Saving current run data... [10/10/2010 10:10:10]
Generating report for January 2010
Generating summary report
Saving history information...
941323 records in 0.96 seconds

Why don't Referrers or User Agents show up?

In order for the Webalizer to produce statistics for user agents (browsers) and referrers, that information needs to be in the log files produced by the web server. Most servers by default only produce CLF logs, which do not include the extra information. The way you have your server include this information depends on what server you are running.

Why does the country section show 100% unresolved?

Make sure the you are using Webalizer 2.x or higher and the directives "DNSCache dns_cache.db" and "DNSChildren 20" are enabled like in the Calomel.org webalizer.conf example. Otherwise Webalizer will not be able to do dns resolution.

What is the difference between 'HITS' and 'FILES'?

Basically, HITS is the total number of HTTP requests the server received during the reporting period. Any request made to the server is considered a hit. FILES is the number of hits that actually resulted in something being sent back to the user, such as an HTML page or image. 'Total Files' and '200 - OK' totals should be the same. If you add up the totals in the 'Hits by Response Code' section, it should be the same as the 'Total Hits' figure.

Why do my reports show more Sites than Visits?

Visits are only triggered when a valid request is found for a page, as defined by your PageType setting (or a URL that ends with a slash, which is also considered a page type). Sites however, are counted regardless of the request type. It is very common to have more sites than visits, particularly if you host non-pagetype URLs on your site that are linked to from the outside. If you are not hosting URLs that are linked to from outside sites, then make sure your PageType setting is correct. The default is .htm, .html and .cgi extensions, unless you specify otherwise.

Can I run The Webalizer on partial logs if my logs are really big?

Yes. You need to enable incremental processing. This allows you to rotate your logs as often as needed without the loss of statistical detail between runs. Use the "Incremental" keyword in your configuration file, or the "-p" command line switch to enable incremental processing.

Why do I get "Error adding xxx node, skipping" errors?

You ran out of memory for the size of your log data set. The error occurs when a malloc call is made to allocate free memory, and fails. You can increase your swap space and the disk thrashing will be horrible, but the only real solution is to add more physical memory.





Questions, comments, or suggestions? Contact Calomel.org