First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges

Tools

KRAGELJ, Matjaž and KOVAČIČ, Mitja (2015) First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges. Paper presented at: IFLA WLIC 2015 - Cape Town, South Africa in Session 90 - Preservation and Conservation with Information Technology.

Bookmark or cite this item: https://library.ifla.org/id/eprint/1191

Preview

PDF (998kB)

Language: English (Original)

Available under licence Creative Commons Attribution.

Bookmark or cite this item: https://library.ifla.org/id/eprint/1191/1/090-kragelj-en.pdf

Abstract

English

First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges

The National and University Library (NUK) has been archiving the web for almost fifteen years. During the last six years, we have been trying to act on different levels of harvesting. For most of the time, we have dealt with harvesting of selected web sites that might be significant for future generations. The harvesting process runs smoothly, with the exception of some technical difficulties resulting from the use of scripted languages (for instance Ajax, Flash, Java script, asynchronous transmissions, real time streaming protocols, etc.). The number of archived web pages keeps growing very fast. We are also very successful in harvesting social media web sites with tools developed in NUK. Being aware that the amount of the web pages cannot be compared with the harvested one - it is much more extensive – we decided to start the Slovenian domain (*.si) harvesting. The first domain harvesting was successful; however, we realized that much deeper and broader levels should be harvested by using heuristic methods. Our experiences showed that most informative web contents are hidden beneath the *.si domain's data provided by ARNES (Academic Research Network of Slovenia), therefore, the contents are not accessible. The paper presents the results of the first harvesting iteration of the Slovenian web. Further, on a sample of the first hundred domains, the results of the first and second harvesting iteration will be compared and analysed. At the end, the relevance of data acquired in the harvested web pages as a digital library complementary data source will be presented.

Item Type:

Conference or Workshop Item (Paper)

Conference details:

IFLA WLIC 2015 - Cape Town, South Africa

Session 90 - 10 years of development to collect preserve and access Web-Sites: Ready to go for everyone!? - Preservation and Conservation with Information Technology

Monday 17 August 2015 09:30 - 12:45 | Room: Ballroom West

Related URLs:

Congress website

Divisions:

Division 2 Library Collections > Preservation and Conservation Section

Authors:

Name	Affiliation	Country
KRAGELJ, Matjaž	Digital Library Development Department, Information Technology and Digital Library Division, National and University Library, Ljubljana	Slovenia
KOVAČIČ, Mitja	Digital Library Development Department, Information Technology and Digital Library Division, National and University Library, Ljubljana	Slovenia

Uncontrolled Keywords:

Web archiving, harvesting, national domain, social networks harvesting, digital library

Date Deposited:

01 Jul 2015 12:48

Last Modified:

14 Aug 2017 08:56

URI:

https://library.ifla.org/id/eprint/1191

FOR IFLA HQ (login required)

Edit item

Search form

First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges

Abstract

First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges

Session 90 - 10 years of development to collect preserve and access Web-Sites: Ready to go for everyone!? - Preservation and Conservation with Information Technology

FOR IFLA HQ (login required)