First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges

KRAGELJ, Matjaž and KOVAČIČ, Mitja (2015) First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges. Paper presented at: IFLA WLIC 2015 - Cape Town, South Africa in Session 90 - Preservation and Conservation with Information Technology.

Bookmark or cite this item: http://library.ifla.org/id/eprint/1191
[img]
Preview
Language: English (Original)
Available under licence Creative Commons Attribution.

Abstract

First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges

The National and University Library (NUK) has been archiving the web for almost fifteen years. During the last six years, we have been trying to act on different levels of harvesting. For most of the time, we have dealt with harvesting of selected web sites that might be significant for future generations. The harvesting process runs smoothly, with the exception of some technical difficulties resulting from the use of scripted languages (for instance Ajax, Flash, Java script, asynchronous transmissions, real time streaming protocols, etc.). The number of archived web pages keeps growing very fast. We are also very successful in harvesting social media web sites with tools developed in NUK. Being aware that the amount of the web pages cannot be compared with the harvested one - it is much more extensive – we decided to start the Slovenian domain (*.si) harvesting. The first domain harvesting was successful; however, we realized that much deeper and broader levels should be harvested by using heuristic methods. Our experiences showed that most informative web contents are hidden beneath the *.si domain's data provided by ARNES (Academic Research Network of Slovenia), therefore, the contents are not accessible. The paper presents the results of the first harvesting iteration of the Slovenian web. Further, on a sample of the first hundred domains, the results of the first and second harvesting iteration will be compared and analysed. At the end, the relevance of data acquired in the harvested web pages as a digital library complementary data source will be presented.

FOR IFLA HQ (login required)

Edit item Edit item
.