This COVID-19 academic graph dataset is generated based on the COVID-19 Open Research Dataset created by Allen Institute for AI. Based on the 2020-03-27 release which contains 33000 full text articles, we created a citation network and a semantic network.
By filtering the paper with missing title or author information, we get in total 28,757 papers (in JSON format) and construct two citation networks. The core network contains the author and citation information of these 28,757 papers only, while the complete network contains the papers cited by these 28,757 papers and their authors. The details of these two networks are shown below.
|# of nodes||181,166||2,223,994|
|# of edges||233,556||6,431,265|
Each network contains two files, the node and the edge file, both contain triples separated by "\t". Each triple in the node file is in the format of (nodeID, node attribute name, attribute value). For paper nodes, we include the title information and the link to its original JSON file, i.e. fileName. Similarly, for author nodes, we include their firstname, lastname and affiliation. Each triple in the edge file is in the format of (nodeID, edge type, nodeID).
By extracting the full text of each paper and leveraging OpenIE,
we generate a semantic graph that contains COVID-related information. The graph
COVID-academic-semantic contains in total 46,020,975 triples, which includes 17,666,478 entities and
1,778,310 types of relations. For each triple, we associate the original paper file name that the triple is
Additionaly, we use Wikifier to identify the entities and link them to Wiki pages. We found that 150,176 entities can be identified.
Citation network - core [COVID-academic-core.zip]
Citation network - complete [COVID-academic-complete.zip]
Semantic network [COVID-academic-semantic.zip]
Semantic network - Wiki entities [COVID-academic-semantic-entity-wiki.zip]