Using Named Entity Recognition for Automatic Indexing

GOH, Rachael (2018) Using Named Entity Recognition for Automatic Indexing. Paper presented at: IFLA WLIC 2018 – Kuala Lumpur, Malaysia – Transform Libraries, Transform Societies in Session 115 - Subject Analysis and Access.

Bookmark or cite this item: http://library.ifla.org/id/eprint/2214
[img]
Preview
Language: English (Original)
Available under licence Creative Commons Attribution.
Bookmark or cite this item: http://library.ifla.org/2214/1/115-goh-en.pdf

Abstract

Using Named Entity Recognition for Automatic Indexing

Automatic indexing has been used by libraries to index their collections for many years. Recent technological advances have allowed for a refinement of automatic indexing, providing quicker and more accurate results. In 2016, the National Library Board (NLB) embarked on a Named Entity Recognition (NER) project which leveraged on natural language processing techniques to extract names of entities such as people, organisations and places that are found in documents such as articles. Articles from NLB’s Infopedia (eresources.nlb.gov.sg/infopedia) and HistorySG (eresources.nlb.gov.sg/history) were identified for extraction. This has since enabled users to search for related resources in NLB’s vast collections. In late 2017, the NER project was extended to include extraction of topics mentioned in articles and metadata records. This automated indexing was done using NLB’s controlled vocabularies such as Events, Historical Events, Programmes, Legal Acts, Awards, Time, and SingHeritage. Data from other agencies, National Archives of Singapore (NAS) and National Heritage Board (NHB) collections were also included. A smaller scope of the project includes running the documents against external vocabularies such as GeoNames and Wikidata. This paper will discuss the process, the method used to evaluate the accuracy of the named entities extracted, the NER issues discovered across the collections, and the challenges faced in ensuring that the extraction process is improved after verification. Lastly, the paper will also discuss how the NER results are used to support NLB’s digital cataloguing, for example, by adding the extracted entities to the subject field in the metadata records without manual cataloguing for collections with minimal or no subject headings. This allows the collections to be searched by subjects in NLB’s OneSearch platform (search.nlb.gov.sg), where metadata from 3 cultural institutions (library, archive and heritage) are aggregated via an integrated interface to enable a single search. The NER results are also used to support NLB’s Linked Data services such as the Linked Data widget that identifies entities from the website articles and links them to relevant resources from other collections. In this way, we supplement manual indexing by creating additional access points to the articles, while at the same time creating links to our catalogue records. Implemented in NLB’s Infopedia and HistorySG websites, this service is being explored for future implementation in other websites such as NAS’ Archives Online (nas.gov.sg) and NHB’s RootsSG website (roots.sg).

FOR IFLA HQ (login required)

Edit item Edit item
.