The Refcat dataset is released under a CC0 license and is available for download from archive for the extraction and matching process, including exact and fuzzy citation matching (refcat and fuzzycat), are also released as open-source tools. For those interested in technical details about the project, a white paper is available on arxiv.org authored by IA engineers, including Martin Czygan, who led work on Refcat, and is described in our catalog user guide.
What does Refcat mean for regular users of IA Scholar? Refcat results from work to ensure the interconnection between material within IA Scholar and other resources archived in Internet Archive in order to make browsing and lookups easier and to ensure overall citation integrity and persistence. For example, there are over 25 million web links in the citations in Refcat and we were able to match ~14 million of these to archived web pages in Wayback Machine and also found that ~18% of these matched phone number database web citations are no longer available on the live web.
Web links in citations not in Wayback Machine have been added to ongoing web harvests. We also matched over 20 million citations to books that are available for lending in our Open Library service and matched over 1 million citations to Wikipedia entries.
Besides interconnection, Refcat will allow users to understand what works have cited a specific scholarly resource (i.e. “cited by” or “inbound citations”) that will help with improved discovery features. Finally, knowing the full “knowledge graph” of IA Scholar helps us better identify important scholarly material that we have not yet archived, thus improving the overall quality and extent of the collection.