Subscribe to Google group to keep yourself updated.
Learn about the datasets
- Please learn about the datasets and their usage.
Installing Python Package
First, please install our ogb
Python package, as all of our datasets are downloaded and prepared using the package.
The model evaluation and test submission file preparation are also handled by our package.
Please install/update it by:
pip install -U ogb
Summary of Datasets
OGB-LSC provides three large-scale datasets. The dataset statistics as well as basic information are summarized below. Each dataset is described in detail in the dataset page (jump to the links).
Task category | Name | Package | #Graphs | #Total nodes | #Total edges | Task Type | Metric | Download size |
---|---|---|---|---|---|---|---|---|
Node-level | MAG240M | >=1.3.2 | 1 | 244,160,499 | 1,728,364,232 | Multi-class classification | Accuracy | 167GB |
Link-level | WikiKG90Mv2 | >=1.3.3 | 1 | 91,230,610 | 601,062,811 | KG completion | MRR | 89GB |
Graph-level | PCQM4Mv2 | >=1.3.2 | 3,746,619 | 52,970,652 | 54,546,813 | Regression | MAE | 59MB‡ |
‡: The PCQM4Mv2 dataset is provided in the SMILES strings. After processing them into graph objects, the eventual file size will be around 8GB.
Important: Make sure below prints the required package version for the dataset you are working on.
python -c "import ogb; print(ogb.__version__)"
Baselines
In our paper, we further perform an extensive baseline analysis on each dataset, implementing simple baseline models as well as advanced expressive models at scale.
We find that advanced expressive models, despite requiring more efforts to scale up, do benefit from large data and significantly outperform simple baseline models that are easy to scale.
All of our baseline code is made publicly available to facilitate public research.
Please also check out public leaderboards (evaluated on test-dev
set) for state-of-the-art submissions.