First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges
KRAGELJ, Matjaž and KOVAČIČ, Mitja (2015) First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges. Paper presented at: IFLA WLIC 2015 - Cape Town, South Africa in Session 90 - Preservation and Conservation with Information Technology.
First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challengesThe National and University Library (NUK) has been archiving the web for almost fifteen years. During the last six years, we have been trying to act on different levels of harvesting. For most of the time, we have dealt with harvesting of selected web sites that might be significant for future generations. The harvesting process runs smoothly, with the exception of some technical difficulties resulting from the use of scripted languages (for instance Ajax, Flash, Java script, asynchronous transmissions, real time streaming protocols, etc.). The number of archived web pages keeps growing very fast. We are also very successful in harvesting social media web sites with tools developed in NUK. Being aware that the amount of the web pages cannot be compared with the harvested one - it is much more extensive – we decided to start the Slovenian domain (*.si) harvesting. The first domain harvesting was successful; however, we realized that much deeper and broader levels should be harvested by using heuristic methods. Our experiences showed that most informative web contents are hidden beneath the *.si domain's data provided by ARNES (Academic Research Network of Slovenia), therefore, the contents are not accessible. The paper presents the results of the first harvesting iteration of the Slovenian web. Further, on a sample of the first hundred domains, the results of the first and second harvesting iteration will be compared and analysed. At the end, the relevance of data acquired in the harvested web pages as a digital library complementary data source will be presented.
|Item Type:||Conference or Workshop Item (Paper)|
|Conference details:||IFLA WLIC 2015 - Cape Town, South Africa
Session 90 - 10 years of development to collect preserve and access Web-Sites: Ready to go for everyone!? - Preservation and Conservation with Information Technology
|Divisions:||Division 2 Library Collections > Preservation and Conservation Section|
|Uncontrolled Keywords:||Web archiving, harvesting, national domain, social networks harvesting, digital library|
|Date Deposited:||01 Jul 2015 12:48|
|Last Modified:||01 Jul 2015 12:48|
FOR IFLA HQ (login required)