COVID-19 Academic Graph

This COVID-19 academic graph dataset is generated based on the COVID-19 Open Research Dataset created by Allen Institute for AI. Based on the 2020-03-27 release which contains 33000 full text articles, we created a citation network and a semantic network.

COVID-19 Citation Network

By filtering the paper with missing title or author information, we get in total 28,757 papers (in JSON format) and construct two citation networks. The core network contains the author and citation information of these 28,757 papers only, while the complete network contains the papers cited by these 28,757 papers and their authors. The details of these two networks are shown below.

Datasets Core Complete
# of nodes 181,166 2,223,994
type:paper 28,757 1,132,472
type:author 152,409 1,091,522
# of edges 233,556 6,431,265
type:isWrittenBy 193,525 4,684,141
type:isCitedBy 40,031 1,747,124

Each network contains two files, the node and the edge file, both contain triples separated by "\t". Each triple in the node file is in the format of (nodeID, node attribute name, attribute value). For paper nodes, we include the title information and the link to its original JSON file, i.e. fileName. Similarly, for author nodes, we include their firstname, lastname and affiliation. Each triple in the edge file is in the format of (nodeID, edge type, nodeID).

COVID-19 Semantic Network

By extracting the full text of each paper and leveraging OpenIE, we generate a semantic graph that contains COVID-related information. The graph COVID-academic-semantic contains in total 46,020,975 triples, which includes 17,666,478 entities and 1,778,310 types of relations. For each triple, we associate the original paper file name that the triple is generated from.
Additionaly, we use Wikifier to identify the entities and link them to Wiki pages. We found that 150,176 entities can be identified.

Download Link

Citation network - core []
Citation network - complete []
Semantic network []
Semantic network - Wiki entities []