Node Property Prediction

The task is to predict properties of single nodes.

Summary

- Datasets

Scale Name Package #Nodes #Edges* #Tasks Split Type Task Type Metric
Medium ogbn-products >=1.1.1 2,449,029 61,859,140 1 Sales rank Multi-class classification Accuracy
Medium ogbn-proteins >=1.1.1 132,534 39,561,252 112 Species Binary classification ROC-AUC
Small ogbn-arxiv >=1.1.1 169,343 1,166,243 1 Time Multi-class classification Accuracy
Large ogbn-papers100M >=1.2.0 111,059,956 1,615,685,872 1 Time Multi-class classification Accuracy
Medium ogbn-mag >=1.2.1 1,939,743 21,111,007 1 Time Multi-class classification Accuracy

Note: For undirected graphs, the loaded graphs will have the doubled number of edges because we add the bidirectional edges automatically.

- Module

We prepare different data loader variants: (1) Pytorch Geometric one (2) DGL one and (3) library-agnostic one. We also prepare a unified performance evaluator.


Dataset ogbn-products (Leaderboard):

Graph: The ogbn-products dataset is an undirected and unweighted graph, representing an Amazon product co-purchasing network [1]. Nodes represent products sold in Amazon, and edges between two products indicate that the products are purchased together. We follow [2] to process node features and target categories. Specifically, node features are generated by extracting bag-of-words features from the product descriptions followed by a Principal Component Analysis to reduce the dimension to 100.

Prediction task: The task is to predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels.

Dataset splitting: We consider a more challenging and realistic dataset splitting that differs from the one used in [2] Instead of randomly assigning 90% of the nodes for training and 10% of the nodes for testing (without use of a validation set), we use the sales ranking (popularity) to split nodes into training/validation/test sets. Specifically, we sort the products according to their sales ranking and use the top 10% for training, next top 2% for validation, and the rest for testing. This is a more challenging splitting procedure that closely matches the real-world application where labels are first assigned to important nodes in the network and ML models are subsequently used to make predictions on less important ones.

Note: A very small number of self-connecting edges are repeated (see here); you may remove them if necessary.

References

[1] http://manikvarma.org/downloads/XC/XMLRepository.html
[2] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-GCN: An efficient algorithm for training deep and large graph convolutional networks. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 257–266, 2019.

License: Amazon license


Dataset ogbn-proteins (Leaderboard):

Graph: The ogbn-proteins dataset is an undirected, weighted, and typed (according to species) graph. Nodes represent proteins, and edges indicate different types of biologically meaningful associations between proteins, e.g., physical interactions, co-expression or homology [1,2]. All edges come with 8-dimensional features, where each dimension represents the strength of a single association type and takes values between 0 and 1 (the larger the value is, the stronger the association is). The proteins come from 8 species.

Prediction task: The task is to predict the presence of protein functions in a multi-label binary classification setup, where there are 112 kinds of labels to predict in total. The performance is measured by the average of ROC-AUC scores across the 112 tasks.

Dataset splitting: We split the protein nodes into training/validation/test sets according to the species which the proteins come from. This enables the evaluation of the generalization performance of the model across different species.

References

[1] Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta- Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork, et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research, 47(D1):D607–D613, 2019.
[2] Gene Ontology Consortium. The gene ontology resource: 20 years and still going strong. Nucleic Acids Research, 47(D1):D330–D338, 2018.

License: CC-0


Dataset ogbn-arxiv (Leaderboard):

Graph: The ogbn-arxiv dataset is a directed graph, representing the citation network between all Computer Science (CS) arXiv papers indexed by MAG [1]. Each node is an arXiv paper and each directed edge indicates that one paper cites another one. Each paper comes with a 128-dimensional feature vector obtained by averaging the embeddings of words in its title and abstract. The embeddings of individual words are computed by running the skip-gram model [2] over the MAG corpus. In addition, all papers are also associated with the year that the corresponding paper was published.

Prediction task: The task is to predict the 40 subject areas of arXiv CS papers, e.g., cs.AI, cs.LG, and cs.OS, which are manually determined (i.e., labeled) by the paper’s authors and arXiv moderators. With the volume of scientific publications doubling every 12 years over the past century, it is practically important to automatically classify each publication’s areas and topics. Formally, the task is to predict the primary categories of the arXiv papers, which is formulated as a 40-class classification problem.

Dataset splitting: We consider a realistic data split based on the publication dates of the papers. The general setting is that the ML models are trained on existing papers and then used to predict the subject areas of newly-published papers, which supports the direct application of them into real-world scenarios, such as helping the arXiv moderators. Specifically, we propose to train on papers published until 2017, validate on those published in 2018, and test on those published since 2019.

References

[1] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020.
[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representationsof words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3111–3119, 2013.

License: ODC-BY


Dataset ogbn-papers100M (Leaderboard):

Graph: The ogbn-papers100M dataset is a directed citation graph of 111 million papers indexed by MAG [1]. Its graph structure and node features are constructed in the same way as ogbn-arxiv. Among its node set, approximately 1.5 million of them are arXiv papers, each of which is manually labeled with one of arXiv’s subject areas. Overall, this dataset is orders-of-magnitude larger than any existing node classification datasets.

Prediction task: Given the full ogbn-papers100M graph, the task is to predict the subject areas of the subset of papers that are published in arXiv. The majority of nodes (corresponding to non-arXiv papers) are not associated with label information, and only their node features and reference information are given. The task is to leverage the entire citation network to infer the labels of the arXiv papers. In total, there are 172 arXiv subject areas, making the prediction task a 172-class classification problem.

Dataset splitting: The splitting strategy is the same as that used in ogbn-arxiv, i.e., the time-based split. Specifically, the training nodes (with labels) are all arXiv papers published until 2017, while the validation nodes are the arXiv papers published in 2018, and the models are tested on arXiv papers published since 2019.

References

[1] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020.

License: ODC-BY


Dataset ogbn-mag (Leaderboard):

Graph: The ogbn-mag dataset is a heterogeneous network composed of a subset of the Microsoft Academic Graph (MAG) [1]. It contains four types of entities—papers (736,389 nodes), authors (1,134,649 nodes), institutions (8,740 nodes), and fields of study (59,965 nodes)—as well as four types of directed relations connecting two types of entities—an author is “affiliated with” an institution, an author “writes” a paper, a paper “cites” a paper, and a paper “has a topic of” a field of study. Similar to ogbn-arxiv, each paper is associated with a 128-dimensional word2vec feature vector, and all the other types of entities are not associated with input node features.

Prediction task: Given the heterogeneous ogbn-mag data, the task is to predict the venue (conference or journal) of each paper, given its content, references, authors, and authors’ affiliations. This is of practical interest as some manuscripts’ venue information is unknown or missing in MAG, due to the noisy nature of Web data. In total, there are 349 different venues in ogbn-mag, making the task a 349-class classification problem.

Dataset splitting: We follow the same time-based strategy as ogbn-arxiv and ogbn-papers100M to split the paper nodes in the heterogeneous graph, i.e., training models to predict venue labels of all papers published before 2018, validating and testing the models on papers published in 2018 and since 2019, respectively.

References

[1] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020.

License: ODC-BY


Data Loader

To load a dataset, replace d_name with the dataset name (e.g., "ogbn-proteins").

Pytorch Geometric Loader

from ogb.nodeproppred import PygNodePropPredDataset

dataset = PygNodePropPredDataset(name = d_name) 

split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx["train"], split_idx["valid"], split_idx["test"]
graph = dataset[0] # pyg graph object

DGL Loader

from ogb.nodeproppred import DglNodePropPredDataset

dataset = DglNodePropPredDataset(name = d_name)

split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx["train"], split_idx["valid"], split_idx["test"]
graph, label = dataset[0] # graph: dgl graph object, label: torch tensor of shape (num_nodes, num_tasks)

{train,valid,test}_idx are torch tensors of shape (num_nodes,), representing the node indices assigned to training/validation/test sets. Prediction target in the Pytorch Geometric dataset can be accessed by graph.y, which is a torch tensor of shape (num_nodes, num_tasks), where the i-th row represents the target labels of i-th node.

Library-Agnostic Loader

from ogb.nodeproppred import NodePropPredDataset

dataset = NodePropPredDataset(name = d_name)

split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx["train"], split_idx["valid"], split_idx["test"]
graph, label = dataset[0] # graph: library-agnostic graph object

The library-agnostic graph object is a dictionary containing the following keys: edge_index, edge_feat, node_feat, and num_nodes, which are detailed below.

  • edge_index: numpy arrays of shape (2, num_edges), where each column represents an edge. The first row and the second row represent the indices of source and target nodes. Undirected edges are represented by bi-directional edges.
  • edge_feat: numpy arrays of shape (num_edges, edgefeat_dim), where edgefeat_dim is the dimensionality of edge features and i-th row represents the feature of i-th edge. This can be None if no input edge features are available.
  • node_feat: numpy arrays of shape (num_nodes, nodefeat_dim), where nodefeat_dim is the dimensionality of node features and i-th row represents the feature of i-th node. This can be None if no input node features are available.
  • num_nodes: number of nodes in the graph.

Heterogeneous graph: We represent a heterogeneous graph using dictionaries: edge_index_dict, edge_feat_dict, node_feat_dict, and num_nodes_dict.

  • edge_index_dict: A dictionary mapping each triplet (head type, relation type, tail type) into corresponding edge_index.
  • edge_feat_dict: A dictionary mapping each triplet (head type, relation type, tail type) into corresponding edge_feat.
  • node_feat_dict: A dictionary mapping each node type into corresponding node_feat.
  • num_nodes_dict: A dictionary mapping each node type into corresponding num_nodes.

Note: Some graph datasets may contain additional meta-information in node or edges such as their time stamps. Although they are not given as default input features, researchers should feel free to exploit these additional information.


Performance Evaluator

Evaluators are customized for each dataset. We require users to pass a pre-specified format to the evaluator. First, please learn the input and output format specification of the evaluator as follows.

from ogb.nodeproppred import Evaluator

evaluator = Evaluator(name = d_name)
print(evaluator.expected_input_format) 
print(evaluator.expected_output_format) 

Then, you can pass the input dictionary (denoted by input_dict below) of the specified format, and get the performance of your prediction.

# In most cases, input_dict is
# input_dict = {"y_true": y_true, "y_pred": y_pred}
result_dict = evaluator.eval(input_dict)