Rules for OGB-LSC Leaderboards
Please read carefully about what are/aren’t allowed in the OGB-LSC leaderboard submissions.
We allow you to use training and validation sets in any ways. For example, there is no need to use the validation set only for model selection, and you can directly train your model on the validation set if you find useful. Nevertheless, it is often helpful for the community to know the validation performance reported on the standardized validation split. Therefore, we still provide the official validation set and require every leaderboard submission to report the validation performance on it.
For the test data, you should only use them for your model inference (make prediction and save it for your test submissions). In other words, your model should be developed only based on training and validation sets and should not touch the test data except for the final inference. The only exception is the MAG240M, where you can use the test nodes in any ways, since the dataset is modeled as a transductive prediction task (i.e., test nodes are part of the entire graph).
Code and Technical Report Submissions
- All the code to reproduce your results (including data pre-processing and model training/inference) and save the test submission.
README.mdthat contains all the instructions to run the code (from data pre-processing to model inference on test data).
In addition, we require a short technical report that describes your approach. The link can be either Arxiv or PDF uploaded to your Github repository. You are free to update the report once the test-dev performance is announced.
Use of External Data: Not allowed
We do not allow the use of any external datasets to train models.
Use of Text data: Not allowed
Based on the request from the community, we have released text data for MAG240M (Download (33GB)) and WikiKG90Mv2 (Download (2.4GB)). The text data can be used in various purposes, such as analyzing model’s predictions, improving model performance, and pre-training models. However, for the purpose of this leaderboard, we do not allow the models to use the text data. This is because when LLMs are used, they may leak our test set information by being trained on our test data (which is publicly available). In the future, we plan to set up a new specialized leaderboard where the use of LLMs and text data is allowed. We welcome suggestions from the community.
Moreover, since text data is now released, it becomes easier to reveal hidden test labels by accessing the public database. We ask the community not to do it. Keep in mind that you will need to share all the code to reproduce your solution through public Github repository. This means that any obvious misconduct (e.g., training or doing early stopping on test labels, directly using the test labels as predictions) will be revealed.
Test Inference Time for PCQM4Mv2
Note: The motivation behind the rules here can be better understood after you read the description of the
PCQM4Mv2 dataset, the goal is to use ML to accelerate expensive DFT calculations (which takes up to a few hours per molecule!).
In order for the model to be practically useful, the inference time of the ML model must be fast enough.
PCQM4Mv2 only, we limit the computational budget for the test-time inference.
The specific rules are as follows:
- The total inference time over the ~147K test-dev/test-challenge molecules (i.e., time to predict target values of the test molecules from their raw SMILES strings) should not exceed 4 hours, using a single GPU/TPU/IPU and single CPU.*1 Note that multi-threading on a multi-core CPU is allowed. Once you win the contest, you will need to provide the inference code (example here) that takes the ~147K test SMILES strings as input and saves ~147K prediction values within 4 hours with single GPU and CPU.
- You are allowed to use the following chemistry packages to process molecules from their SMILES strings: rdkit, Open Babel, and pyscf. The 4-hour budget must include the pre-processing time of test molecules using these packages, e.g., transforming test SMILES strings into graphs. This means that you cannot use the expensive (quantum) calculations to do feature engineering for your input test graphs, while you may include many more cheap features in your graphs.
For your reference, the test inference time for our baseline GNN takes about 1 minute (you can run the code here) on a single GeForce RTX 2080 GPU and an Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz. Hence, the above 4-hour budget is quite generous for an ordinary GNN model applied to our default molecular graphs. However, the 4-hour budget could be the limiting factor, if you want to apply expensive feature engineering to obtain your input test graphs. Note that from the quantum chemistry point of view, making predictions over the ~147K molecules in 4 hours (or ~0.1 second per molecule) is about four-orders-of-magnitude faster than the original DFT calculations, making the ML-based approach practically fast and useful.
*1 Ideally, we would like our participants to use the GPU/CPU with the same specs as ours (GeForce RTX 2080 GPU, and an Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz.). However, as it is hard to enforce the hardware constraint, we also allow the use of other GPU/CPU specs (the 4-hour budget stays the same for simplicity). We will require you to report the hardware specs in the final test submission.
If you need any clarifications about the rules, please feel free to make posts at PCQM4Mv2’s discussion forum.
All the information provided in the leaderboard submission page must be correct and follows the above rules of OGB-LSC. The leaderboard submission cannot be deleted once it is public. Whenever you are contacted by the OGB-LSC Team, you need to provide information to verify the correctness of the information. Otherwise, the submission may be deleted, and future submissions may be prohibited.