TRL
TOP PAGETokyo Research LaboratoryEmploymentProjectsRelated InformationIBM Research
Japanese page is here.

Web site visualization by the combination of Web sitemap and access statistics chart


Outline

It is very important to understand the trend and distribution of accesses of Web sites, especially for designers, administrators, and analysts of Web sites. Many of existing commecial Web access analysis software tools simply represent the access statistics by using bar charts, pie charts, ranking table, and so on; however, it might be difficult to discover local, subconcious, and interesting access trends by only using such simple representations.

To improve the problem, we developed the Web site vsiualization system that provides the overview of access distributions. The system automatically generates the sitemap by Data jewelry-box algirithm , and represents the access distribution by the combination of the sitemap and access distribution chart.


Research contents

The following figure denotes the overview of the presented system. The system first reads the access log files, which are usually stored by Web servers. The system then corrects the URLs of accessed files, and constructs the hierarchical data according to the hierarchy of URL directories (see the lower-left part of the figure). It automatically generates the sitemap by Data jewelry box algorithm.

Simultaneously, the system aggregates the access according to the user-specified attributes, such as date, time of day, return code, referrer, and host of client. It then represents the statistics as access distribution charts (see the upper-right part of the figure). Clicking a part of the access distribution chart, the system maps the statistics of accesses corresponding to the clicked part onto the sitemap. It finally represents the access distribution as three-dimensional bar charts on the sitemap (see the lower-right part of the figure) .

The following two images are the examples of access distribution charts represented by the system. The left image is generated by counting the access according to the date, and then divide and color the bars to 24 parts according to the hour of the day. By clicking a part of the chart, the system reprensets the access distribution in the hour corresponding to the clicked part. The right image is generated by counting the access according to the return code, and then divide and color the bars to 24 parts according to the hour of the day.

The following image is an example of the representation of access distribution by the system, which is obtained by clicking a red part of the seventh day of the above left image. On the sitemap, gray dots denote the webpages which are not accessed in the hour. Red bars denote the webpage which are accessed in the hour, and these altitudes denote the number of accesses in the hour. Recutangular border lines surrounding the gray dots and red bars denote the directory hierarchy of the Web site.

The green circle in the upper-right part of the above figure denotes that more than hundred webpages in one directory were accessed in the hour. We found that all these webpages are the API reference of a open source codes. It is sometimes difficult to discuver such subconcious accesses by eager browsers; however, this example shows that the system provides the functionality to easily discover such accesses.

The other green circle in the lower-right patt of the above figure denotes that one webpage was very frequently accessed in the hour. The number of accesses of the webpage was only high on that day. Clicking the bar, the system provides the access statistics chart for the accesses of the specific webpage. The following image denotes the access distribution chart that is generated by aggregating the access of the webpage according to the referrer. Seeing the the chart, we found that the webpage was frequently accessed by clicking the link from the website of business newspaper company.

We also found the following access trends of Web sites using the system:

  • Webpages of some software products for servers are frequently accessed only in the daytime on weekdays; on the other hand, webpages of some PC software products are constantly accessed on weekdays and weekends.
  • Many of return condes of accesses to PDF files that are linked without describing their file sizes are 206, which denotes that accesses are terminated before the completion.
  • Webpages which are introduced by business Web sites are mainly accessed in the morning on weekdays.

We are currently implementing the prototype of the visualization system as the following architechture. Receiving visualization requests from clients, the Web server calls the visualiztaion engine, generates sitemaps, aggregates accesses, and generates SVG format image files as the results. It then returns the URLs of the image files to the clients. Clients are finally open the Web browser, and access to the SVG image files.

Here is an example of SVG file (632KB) that represents the result of the visualization . It can be directly displayed on Web browsers that support SVG files.



Back to the top page of this project.

Research home IBM home Order Privacy Legal Contact IBM
Last modified 28 Jun 2002