Is your neural network 'smart' or just big? This benchmark tells you the difference.
This Python package provides a framework for benchmarking neural network operations, inspired by the GWO (Generalized Windowed Operation) theory from the paper "Window is Everything: A Grammar for Neural Operations".
Instead of just measuring accuracy, this benchmark scores operations on their architectural efficiency. It quantifies the relationship between an operation's theoretical Operational Complexity (Ω_proxy) and its real-world performance, helping you design smarter, more efficient models.
The core idea is to break down any neural network operation (like Convolution or Self-Attention) into its fundamental building blocks and score its complexity.
-
GWO (Generalized Windowed Operation): A "grammar" that describes any operation using three components:
- Path (P): Where to look for information (e.g., a local sliding window).
- Shape (S): What form of information to look for (e.g., a square patch).
- Weight (W): What to value in that information (e.g., a learnable kernel).
-
Operational Complexity (Ω_proxy): The "intelligence score" of your operation. A lower score for the same performance means a more efficient design. It's calculated as: Ω_proxy = C_D (Structural Complexity) + α * C_P (Parametric Complexity)
- C_D (Descriptive Complexity): How many basic "primitives" does it take to describe your operation's structure? (You define this based on our guide).
- C_P (Parametric Complexity): How many extra parameters are needed to generate the operation's behavior dynamically? (e.g., the offset prediction network in Deformable Convolution). This is calculated automatically.
Or for development from this repository:
Let's benchmark a simple custom CNN on CIFAR-10.
Step 1: Define your model inheriting from GWOModule
Create your model file my_models.py:
Step 2: Create your benchmark script
Create your main script run_benchmark.py:
Step 3: Run from your terminal
You'll see a detailed analysis of your model's complexity and performance, saved in the benchmark_results directory.
A high score is good, but how high is high enough? To give your results context, we've established a tier system based on the performance of well-known baseline operations.
Check your rank and see all submissions on the official live leaderboard!
Your goal is to design an operation that reaches A-Tier or pushes the boundaries into S-Tier.
To provide clear context, we separate our official results into two categories: Baseline Operations (the basic building blocks) and Reference Architectures (the complete systems you aim to build).
This table shows the performance and complexity of a well-known, powerful architecture. This is not a direct competitor for efficiency scores, but rather the performance target you should aim for. Can you design new operations that allow you to build an architecture that achieves this level of accuracy with a lower total complexity?
| ResNetGWO | 80.64% | 60.0 | 60 | 0.0 | 15.35 | Target |
This table lists the efficiency scores of individual operations. Your goal is to create new operations with a higher score than these classics, which you can then use to build more efficient architectures.
| StandardConv | 990.14 | 69.31 | 6.00 | 6 | 0.0 | 0.50 | B |
| DeformableConv | 771.40 | 69.45 | 8.00 | 8 | 0.003 | 1.63 | C |
| DepthwiseConv | 681.67 | 61.35 | 8.00 | 8 | 0.0 | 0.53 | C |
The tier system applies to individual operations, not full architectures.
-
🏆 S-Tier (State-of-the-Art): Score >= 1800 Reserved for breakthrough operations that set a new standard for efficiency. These designs significantly push the Pareto frontier.
-
🚀 A-Tier (Excellent): 1250 <= Score < 1800 Clearly outperforms the strong StandardConv baseline. This indicates a highly competitive and well-designed operation that is production-ready.
-
✅ B-Tier (Solid Baseline): 900 <= Score < 1250 A robust and competitive score. StandardConv (Score: ~990) is the key benchmark in this tier, making it the minimum target for a strong design.
-
💡 C-Tier (Promising): 500 <= Score < 900 A functional design with potential, but requires refinement to match the efficiency of top baselines. Our DeformableConv (~771) and DepthwiseConv (~681) results fall into this category.
-
🔬 D-Tier (Experimental): Score < 500 Represents an early-stage concept. Keep innovating!
Calculating C_D requires mapping your operation's logic to our official "primitive" vocabulary. For complex operations, a Large Language Model (LLM) like GPT-4, Claude, or Gemini can help you with this analysis.
Here is a ready-to-use prompt template. Simply replace the placeholder with your GWOModule code.
The framework is designed for flexibility and extension.
-
GWOModule (gwo_benchmark.base.GWOModule): The heart of your submission. You must inherit from this abstract class and implement:
- C_D (property): Your calculation of the Descriptive Complexity.
- get_parametric_complexity_modules() (method): A list of nn.Modules that contribute to C_P.
-
Evaluator (gwo_benchmark.evaluator.BaseEvaluator): This class encapsulates all evaluation logic (training, testing, performance measurement).
- Use the built-in Evaluator for standard datasets like CIFAR-10.
- Create your own custom evaluation loop by inheriting from BaseEvaluator for specialized tasks.
-
Datasets (gwo_benchmark.datasets): Easily add support for new datasets by inheriting from BaseDataset and registering your class. See the datasets directory for examples.
We welcome contributions! This project is in its early stages, and we believe it can grow into a standard tool for the deep learning community.
- Add New GWO Models: Implement novel or existing operations (like Transformers, Attention variants, MLPs) as GWOModules in the examples directory.
- Support More Datasets: Help us expand the benchmark to new domains like NLP, Graphs, etc.
- Improve the Core Engine: Enhance the Evaluator, ComplexityCalculator, or add new analysis tools.
Please see our CONTRIBUTING.md for more details.
To ensure the integrity of the framework, please run tests before submitting a pull request.
If you use this framework in your research, please consider citing the original paper:
.png)

