Subscribe to Google group to keep yourself updated.

Overview Datasets Timeline Rules What's New Paticipate Results Workshop Paper

Learn about the datasets

Please learn about the datasets and their usage.

Installing Python Package

First, please install our ogb Python package, as all of our datasets are downloaded and prepared using the package. The model evaluation and test submission file preparation are also handled by our package. Please install/update it by:

pip install -U ogb

Summary of Datasets

OGB-LSC provides three large-scale datasets. The dataset statistics as well as basic information are summarized below. Each dataset is described in detail in the dataset page (jump to the links).

Task category	Name	Package	#Graphs	#Total nodes	#Total edges	Task Type	Metric	Download size
Node-level	MAG240M	>=1.3.2	1	244,160,499	1,728,364,232	Multi-class classification	Accuracy	167GB
Link-level	WikiKG90Mv2	>=1.3.3	1	91,230,610	601,062,811	KG completion	MRR	89GB
Graph-level	PCQM4Mv2	>=1.3.2	3,746,619	52,970,652	54,546,813	Regression	MAE	59MB‡

‡: The PCQM4Mv2 dataset is provided in the SMILES strings. After processing them into graph objects, the eventual file size will be around 8GB.

Important: Make sure below prints the required package version for the dataset you are working on.

python -c "import ogb; print(ogb.__version__)"

Baselines

In our paper, we further perform an extensive baseline analysis on each dataset, implementing simple baseline models as well as advanced expressive models at scale. We find that advanced expressive models, despite requiring more efforts to scale up, do benefit from large data and significantly outperform simple baseline models that are easy to scale. All of our baseline code is made publicly available to facilitate public research. Please also check out public leaderboards (evaluated on test-dev set) for state-of-the-art submissions.