Node Property Prediction
The task is to predict properties of single nodes.
Summary
- Datasets
Scale | Name | Package | #Nodes | #Edges* | #Tasks | Split Type | Task Type | Metric |
---|---|---|---|---|---|---|---|---|
Medium | ogbn-products | >=1.1.1 | 2,449,029 | 61,859,140 | 1 | Sales rank | Multi-class classification | Accuracy |
Medium | ogbn-proteins | >=1.1.1 | 132,534 | 39,561,252 | 112 | Species | Binary classification | ROC-AUC |
Small | ogbn-arxiv | >=1.1.1 | 169,343 | 1,166,243 | 1 | Time | Multi-class classification | Accuracy |
Large | ogbn-papers100M | >=1.2.0 | 111,059,956 | 1,615,685,872 | 1 | Time | Multi-class classification | Accuracy |
Medium | ogbn-mag | >=1.2.1 | 1,939,743 | 21,111,007 | 1 | Time | Multi-class classification | Accuracy |
Note: For undirected graphs, the loaded graphs will have the doubled number of edges because we add the bidirectional edges automatically.
- Module
We prepare different data loader variants: (1) Pytorch Geometric one (2) DGL one and (3) library-agnostic one. We also prepare a unified performance evaluator.
Dataset ogbn-products
(Leaderboard):
Graph: The ogbn-products
dataset is an undirected and unweighted graph, representing an Amazon product co-purchasing network [1]. Nodes represent products sold in Amazon, and edges between two products indicate that the products are purchased together. We follow [2] to process node features and target categories. Specifically, node features are generated by extracting bag-of-words features from the product descriptions followed by a Principal Component Analysis to reduce the dimension to 100.
Prediction task: The task is to predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels.
Dataset splitting: We consider a more challenging and realistic dataset splitting that differs from the one used in [2] Instead of randomly assigning 90% of the nodes for training and 10% of the nodes for testing (without use of a validation set), we use the sales ranking (popularity) to split nodes into training/validation/test sets. Specifically, we sort the products according to their sales ranking and use the top 8% for training, next top 2% for validation, and the rest for testing. This is a more challenging splitting procedure that closely matches the real-world application where labels are first assigned to important nodes in the network and ML models are subsequently used to make predictions on less important ones.
Note: A very small number of self-connecting edges are repeated (see here); you may remove them if necessary.
References
[1] http://manikvarma.org/downloads/XC/XMLRepository.html
[2] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-GCN: An efficient algorithm for training deep and large graph convolutional networks. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 257–266, 2019.
License: Amazon license
Dataset ogbn-proteins
(Leaderboard):
Graph: The ogbn-proteins
dataset is an undirected, weighted, and typed (according to species) graph. Nodes represent proteins, and edges indicate different types of biologically meaningful associations between proteins, e.g., physical interactions, co-expression or homology [1,2]. All edges come with 8-dimensional features, where each dimension represents the approximate confidence of a single association type and takes values between 0 and 1 (the larger the value is, the more confident we are about the association). The proteins come from 8 species.
Prediction task: The task is to predict the presence of protein functions in a multi-label binary classification setup, where there are 112 kinds of labels to predict in total. The performance is measured by the average of ROC-AUC scores across the 112 tasks.
Dataset splitting: We split the protein nodes into training/validation/test sets according to the species which the proteins come from. This enables the evaluation of the generalization performance of the model across different species.
References
[1] Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta- Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research, 47(D1):D607–D613, 2019.
[2] Gene Ontology Consortium. The gene ontology resource: 20 years and still going strong. Nucleic Acids Research, 47(D1):D330–D338, 2018.
License: CC-0
Dataset ogbn-arxiv
(Leaderboard):
Graph:
The ogbn-arxiv
dataset is a directed graph, representing the citation network between all Computer Science (CS) arXiv papers indexed by MAG [1]. Each node is an arXiv paper and each directed edge indicates that one paper cites another one.
Each paper comes with a 128-dimensional feature vector obtained by averaging the embeddings of words in its title and abstract.
The embeddings of individual words are computed by running the skip-gram model [2] over the MAG corpus.
We also provide the mapping from MAG paper IDs into the raw texts of titles and abstracts here.
In addition, all papers are also associated with the year that the corresponding paper was published.
Prediction task: The task is to predict the 40 subject areas of arXiv CS papers, e.g., cs.AI, cs.LG, and cs.OS, which are manually determined (i.e., labeled) by the paper’s authors and arXiv moderators. With the volume of scientific publications doubling every 12 years over the past century, it is practically important to automatically classify each publication’s areas and topics. Formally, the task is to predict the primary categories of the arXiv papers, which is formulated as a 40-class classification problem.
Dataset splitting: We consider a realistic data split based on the publication dates of the papers. The general setting is that the ML models are trained on existing papers and then used to predict the subject areas of newly-published papers, which supports the direct application of them into real-world scenarios, such as helping the arXiv moderators. Specifically, we propose to train on papers published until 2017, validate on those published in 2018, and test on those published since 2019.
References
[1] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020.
[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representationsof words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3111–3119, 2013.
License: ODC-BY
Dataset ogbn-papers100M
(Leaderboard):
Graph:
The ogbn-papers100M
dataset is a directed citation graph of 111 million papers indexed by MAG [1].
Its graph structure and node features are constructed in the same way as ogbn-arxiv
.
Among its node set, approximately 1.5 million of them are arXiv papers, each of which is manually labeled with one of arXiv’s subject areas. Overall, this dataset is orders-of-magnitude larger than any existing node classification datasets.
Prediction task: Given the full ogbn-papers100M
graph, the task is to predict the subject areas of the subset of papers that are published in arXiv.
The majority of nodes (corresponding to non-arXiv papers) are not associated with label information, and only their node features and reference information are given.
The task is to leverage the entire citation network to infer the labels of the arXiv papers.
In total, there are 172 arXiv subject areas, making the prediction task a 172-class classification problem.
Dataset splitting: The splitting strategy is the same as that used in ogbn-arxiv
, i.e., the time-based split.
Specifically, the training nodes (with labels) are all arXiv papers published until 2017, while the validation nodes are the arXiv papers published in 2018, and the models are tested on arXiv papers published since 2019.
References
[1] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020.
License: ODC-BY
Dataset ogbn-mag
(Leaderboard):
Graph:
The ogbn-mag
dataset is a heterogeneous network composed of a subset of the Microsoft Academic Graph (MAG) [1].
It contains four types of entities—papers (736,389 nodes), authors (1,134,649 nodes), institutions (8,740 nodes), and fields of study (59,965 nodes)—as well as four types of directed relations connecting two types of entities—an author is “affiliated with” an institution, an author “writes” a paper, a paper “cites” a paper, and a paper “has a topic of” a field of study. Similar to ogbn-arxiv
, each paper is associated with a 128-dimensional word2vec feature vector, and all the other types of entities are not associated with input node features.
Prediction task: Given the heterogeneous ogbn-mag
data, the task is to predict the venue (conference or journal) of each paper, given its content, references, authors, and authors’ affiliations. This is of practical interest as some manuscripts’ venue information is unknown or missing in MAG, due to the noisy nature of Web data.
In total, there are 349 different venues in ogbn-mag
, making the task a 349-class classification problem.
Dataset splitting: We follow the same time-based strategy as ogbn-arxiv
and ogbn-papers100M
to split the paper nodes in the heterogeneous graph, i.e., training models to predict venue labels of all papers published before 2018, validating and testing the models on papers published in 2018 and since 2019, respectively.
References
[1] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020.
License: ODC-BY
Data Loader
To load a dataset, replace d_name
with the dataset name (e.g., "ogbn-proteins"
).
Pytorch Geometric Loader
from ogb.nodeproppred import PygNodePropPredDataset
dataset = PygNodePropPredDataset(name = d_name)
split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx["train"], split_idx["valid"], split_idx["test"]
graph = dataset[0] # pyg graph object
DGL Loader
from ogb.nodeproppred import DglNodePropPredDataset
dataset = DglNodePropPredDataset(name = d_name)
split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx["train"], split_idx["valid"], split_idx["test"]
graph, label = dataset[0] # graph: dgl graph object, label: torch tensor of shape (num_nodes, num_tasks)
{train,valid,test}_idx
are torch tensors of shape (num_nodes,)
, representing the node indices assigned to training/validation/test sets.
Prediction target in the Pytorch Geometric dataset can be accessed by graph.y
, which is a torch tensor of shape (num_nodes, num_tasks)
, where the i-th row represents the target labels of i-th node.
Library-Agnostic Loader
from ogb.nodeproppred import NodePropPredDataset
dataset = NodePropPredDataset(name = d_name)
split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx["train"], split_idx["valid"], split_idx["test"]
graph, label = dataset[0] # graph: library-agnostic graph object
The library-agnostic graph object is a dictionary containing the following keys: edge_index
, edge_feat
, node_feat
, and num_nodes
, which are detailed below.
edge_index
: numpy ndarray of shape(2, num_edges)
, where each column represents an edge. The first row and the second row represent the indices of source and target nodes. Undirected edges are represented by bi-directional edges.edge_feat
: numpy ndarray of shape(num_edges, edgefeat_dim)
, whereedgefeat_dim
is the dimensionality of edge features and i-th row represents the feature of i-th edge. This can beNone
if no input edge features are available.node_feat
: numpy ndarray of shape(num_nodes, nodefeat_dim)
, wherenodefeat_dim
is the dimensionality of node features and i-th row represents the feature of i-th node. This can beNone
if no input node features are available.num_nodes
: number of nodes in the graph.
Heterogeneous graph: We represent a heterogeneous graph using dictionaries: edge_index_dict
, edge_feat_dict
, node_feat_dict
, and num_nodes_dict
.
edge_index_dict
: A dictionary mapping each triplet(head type, relation type, tail type)
into correspondingedge_index
.edge_feat_dict
: A dictionary mapping each triplet(head type, relation type, tail type)
into correspondingedge_feat
.node_feat_dict
: A dictionary mapping eachnode type
into correspondingnode_feat
.num_nodes_dict
: A dictionary mapping eachnode type
into correspondingnum_nodes
.
Note: Some graph datasets may contain additional meta-information in node or edges such as their time stamps. Although they are not given as default input features, researchers should feel free to exploit these additional information.
Performance Evaluator
Evaluators are customized for each dataset. We require users to pass a pre-specified format to the evaluator. First, please learn the input and output format specification of the evaluator as follows.
from ogb.nodeproppred import Evaluator
evaluator = Evaluator(name = d_name)
print(evaluator.expected_input_format)
print(evaluator.expected_output_format)
Then, you can pass the input dictionary (denoted by input_dict
below) of the specified format, and get the performance of your prediction.
# In most cases, input_dict is
# input_dict = {"y_true": y_true, "y_pred": y_pred}
result_dict = evaluator.eval(input_dict)