Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia (Extended Dataset)
There Is No Preview Available For This Item
This item does not appear to have any files that can be experienced on Archive.org.
Please download files in this item to interact with them on your computer.
Show all files
- Attribution 3.0
- Academic Torrents
- Academic Torrents
The Wikipedia links (WikiLinks) data consists of web pages that
satisfy the following two constraints:
a. contain at least one hyperlink that points to Wikipedia, and
b. the anchor text of that hyperlink closely matches the title of the target Wikipedia page.
We treat each page on Wikipedia as representing an entity
(or concept or idea), and the anchor text as a mention of the
entity. The WikiLinks data set was obtained by iterating
over Google's web index.
This dataset is accompanied by the following tech report:
Please cite the above report if you use this data.
The dataset is divided over 10 gzipped text files
data-0000[0-9]-of-00010.gz. Each of these files can be viewed
without uncompressing them using zcat. For example:
zcat data-00001-of-00010.gz | head
MENTION vacuum tube 421 http://en.wikipedia.org/wiki/Vacuum_tube
MENTION vacuum tubes 10838 http://en.wikipedia.org/wiki/Vacuum_tube
MENTION electron gun 598 http://en.wikipedia.org/wiki/Electron_gun
MENTION fluorescent 790 http://en.wikipedia.org/wiki/Fluorescent
MENTION oscilloscope 1307 http://en.wikipedia.org/wiki/Oscilloscope
MENTION computer monitor 1503 http://en.wikipedia.org/wiki/Computer_monitor
MENTION computer monitors 3066 http://en.wikipedia.org/wiki/Computer_monitor
MENTION radar 1657 http://en.wikipedia.org/wiki/Radar
MENTION plasma screens 2162 http://en.wikipedia.org/wiki/Plasma_screen
Each file is in the following format:
where each web-page is identified by its url (annotated
by "URL"). For every mention (denoted by "MENTION"), we provide the
actual mention string, the byte offset of the mention from the start
of the page and the target url all separated by a tab. It is
possible (and in many cases very likely) that the contents of a
web-page may change over time. The dataset also contains information
about the top 10 least frequent tokens on that page at the time
it was crawled. These line started with a "TOKEN" and contain
the string of the token and the byte offset from the start of the page.
These token strings can be used as fingerprints
to verify if the page used to generate the data has changed. Finally,
pages are separated from each other by two blank lines.
#### Basic Statistics
Number of Document: 11 million
Number of entities: 3 million
Number of mentions: 40 million
Finally please note that this dataset was created automatically
from the web and therefore contains some amount of noise.
Amar Subramanya (firstname.lastname@example.org)
Sameer Singh (email@example.com)
Fernando Pereira (firstname.lastname@example.org)
Andrew McCallum (email@example.com)
- 2018-08-12 18:04:34
- Internet Archive Python library 1.8.1
Uploaded by arkiver2 on