Learning to Theorize the World from Observation
1Korea Advanced Institute of Science and Technology  
* Equal contribution
ICML 2026, Spotlight (Top 2.2% = 536/23918)

What does it mean to understand the world? Contemporary world models often operationalize understanding as accurate future prediction in latent or observation space, as in recent latent-dynamics and generative world-modeling work (Hafner et al., 2024; Bruce et al., 2024; Zhu et al., 2024). Developmental cognitive science, however, suggests a different view: human understanding emerges through the construction of internal theories of how the world works, even before mature language is acquired (Tenenbaum et al., 2011; Goddu and Gopnik, 2024; Dehaene et al., 2022). Inspired by this theory-building view of cognition, we introduce Learning-to-Theorize, a learning paradigm for inferring explicit explanatory theories of the world from raw, non-textual observations.
Background
Most contemporary AI systems, including recent world models, are primarily optimized for future prediction in latent or observation space, reconstruction quality, or task-specific performance (Hafner et al., 2018; Hafner et al., 2019; Hafner et al., 2020). These objectives do not require models to discover explicit, reusable mechanisms that explain how observations are generated and transformed. They can often be satisfied by learning entangled composite transformations that capture correlations among observed inputs and outputs.
This gap motivates the central question of the paper: Can an artificial system learn to construct explicit explanatory theories of the world merely by observing raw, non-textual sensory inputs? Addressing this question requires a shift in learning objectives: from fitting input-output mappings to discovering structured, compositional mechanisms that explain how observations are generated and transformed.
Learning to Theorize
Problem Definition
The learner observes only independent pairs of raw observations \((x,y)\). It does not receive program annotations, primitive labels, intermediate states, language descriptions, or task groups that reveal which examples share the same underlying mechanism, unlike many program-supervised or grouped-task settings (Mao et al., 2019; Nye et al., 2020; Chollet, 2019). The only available evidence is that some hidden transformation turns \(x\) into \(y\).
The learning problem is therefore not merely to reconstruct \(y\) from \(x\), but to recover reusable structure from ungrouped before-and-after evidence. We define the ability to theorize the world as the capacity to (i) discover reusable abstract primitives across phenomena, (ii) learn how to compose them into structured explanations of complex observations, and (iii) explain novel phenomena by forming new compositions of the same primitives. While theories may be instantiated in many concrete forms, such as natural language, probabilistic programs, or symbolic programs, we formulate Learning-to-Theorize (L2T) as latent neural program induction from observation. In this formulation, programs act as executable representations of theories, in the spirit of Language-of-Thought and program-learning accounts of abstraction (Fodor, 1975; Ellis et al., 2020); accordingly, we use the terms theory and program interchangeably.
Phenomenon and Generative Process.
A phenomenon is a pair of observations \((x, y)\), where \(x \sim p(x)\) is a source observation and \(y\) is the corresponding target observation. We assume that each phenomenon is generated by an underlying but unobserved program, or causal mechanism, \(\tau\), that transforms \(x\) to \(y\):
A program is a compositional object formed by combining a finite set of primitive operations. Let \(\mathcal{Z}=\lbrace z_1,z_2,\dots,z_M\rbrace\) denote a set of primitive operations, where each primitive \(z_i\) is associated with an execution function \(f_{z_i}:\mathcal{X}\rightarrow\mathcal{X}\). A program of length \(K\) is defined as an ordered sequence
Program execution corresponds to functional composition:
Given a source observation \(x\), the target observation is obtained by executing the program, either deterministically as \(y=f_\tau(x)\) or more generally through the conditional distribution \(p(y\mid x,\tau)\). This formulation makes theories compositional and reusable: the same primitive functions can be shared across many phenomena, while new theories arise from new sequences of those primitives.
The space of all finite-length primitive sequences \(\mathcal{Z}^{*}\) grows exponentially with program length, so only a small subset of possible programs can appear during training. We therefore assume that training data are generated by a restricted set of programs:
The programs \(\lbrace\tau_n\rbrace\) and their associated execution functions are latent and never observed. Consequently, learning seeks to jointly infer a theory for each phenomenon and to learn a shared set of execution functions that realize these theories. This assumption is intentionally general: in a world-modeling setting, \(x_n\) and \(y_n\) may simply be temporally separated observations, without knowing the time lag, task identity, or mechanism that connects them.
Transferability as Evidence of Theorization
The main evaluation criterion is program transferability. At test time, phenomena are generated by programs drawn from a set disjoint from training:
The test set can include both new compositions of known primitives and programs longer than those realized during training. Thus, evaluation requires not only compositional generalization but also length generalization, or productivity (Lake and Baroni, 2018; Lake and Baroni, 2023).
We consider two phenomena \((x^{(1)},y^{(1)})\) and \((x^{(2)},y^{(2)})\) generated by the same latent program \(\tau\). The model first infers a program \(\hat{\tau}\) from the support pair:
Then the inferred program is executed on a new source \(x^{(2)}\):
Performance is measured by the observation-space error \(d_{obs}(\hat{y}^{(2)},y^{(2)})\). This protocol assesses whether the learned theory captures a transferable generative mechanism, rather than merely fitting an individual input-output pair.
Neural Theorizer

NEO is trained to maximize the conditional likelihood \(p_\theta(y \mid x)\). To explicitly model theory construction, it introduces two latent variables: a program \(\tau=(z_{i_1},\dots,z_{i_K})\) and its execution trace \(s=(s_1,\dots,s_{K+1})\). Under a Markov assumption, the conditional distribution is written as
The prior over programs and execution traces factorizes as
Here, \(p_\theta(s_1 \mid x)\) maps observations to latent states, \(p_\theta(z_{i_k}\mid s_k)\) defines a theory programmer that selects primitive operations, and \(p_\theta(s_{k+1}\mid s_k,z_{i_k})\) defines a shared transition operator implementing primitive execution.
Since exact marginalization is intractable, NEO introduces a variational posterior, following standard variational-inference machinery for latent-variable models (Jordan et al., 1999; Kingma and Welling, 2013):
The theory programmer \(q_\phi(z_{i_k}\mid s_k,y)\) is a goal-conditioned policy over primitive operations. Given the current latent state \(s_k\) and target observation \(y\), it selects the next primitive to steer the execution trace toward a latent state that explains \(y\), thereby inducing a compositional program without explicit program supervision.
Minimum Description Length
Assuming a fixed program length is unrealistic: simple phenomena should not be forced into unnecessarily long explanations. NEO therefore uses the Minimum Description Length principle (Grünwald, 2007) and favors explanations that are both accurate and short. For each intermediate step \(k\), the model decodes a reconstruction \(\hat{y}_k=D_\theta(s_k)\) and selects
where \(\lambda_{\text{MDL}}>1\) penalizes longer programs. The model is then updated using the prediction at the selected explanation length \(k^*\). This pressure encourages the model to explain each phenomenon using the shortest accurate program.
NEO also grounds intermediate states to the observation manifold through a decode-encode consistency term:
This prevents intermediate states from drifting into arbitrary latent regions that may reconstruct the final target but fail to support reusable primitive execution.
Experiments
We introduce the Observation-to-Theory Induction Benchmark (OTIB) to evaluate whether a model can infer reusable primitives from raw observation pairs without supervision. Its central criterion is transferable explanation: a theory induced from one transition should generalize to new inputs, rather than memorizing instance-specific mappings.
OTIB separates three regimes, following the spirit of systematic-generalization benchmarks for world models (Kim et al., 2023). The in-distribution test set uses held-out examples from the training program support. The compositional OOD set holds out program compositions within the same observable length range. The length OOD set uses longer programs than those seen during training. This separation is important because a model can appear successful on in-distribution reconstruction while failing to compose primitives in a new way.
Each evaluation instance consists of a support pair \((x^{(1)},y^{(1)})\) and a query pair \((x^{(2)},y^{(2)})\) generated by the same latent program. Given the support pair, a model induces a theory \(\hat{\tau}\). The induced theory is then executed on both \(x^{(1)}\) and \(x^{(2)}\), producing \(\hat{y}^{(1)}\) and \(\hat{y}^{(2)}\). Self-explainability measures whether the model explains the original support pair. Transferability measures whether the same inferred theory generalizes to the query input, rather than encoding instance-specific information about \(y^{(1)}\).
We instantiate OTIB in three domains:
| Domain | Hidden primitives | Generalization |
|---|---|---|
| GridWorld | Up, right, down, left moves on a 10x10 grid | Comp. OOD + length OOD up to 8 steps |
| Arithmetic Factorization | Multiplication by \(2,3,5,7\) | Comp. OOD + length OOD up to 6 steps |
| Image Editing | brightness+/-, hue+/-, flip h/v, rotation, mask | Comp. OOD + length OOD up to 4 steps |
The parameter \(\alpha\in\lbrace 0.33,0.66,1.00\rbrace\) controls the fraction of short programs included in training. Smaller \(\alpha\) creates a harder setting where some primitives are never observed in isolation and must be discovered by decomposing entangled multi-step transitions. By construction, the training compositions contain enough evidence in principle to recover the full primitive set; the benchmark asks whether a model can actually perform that decomposition.
Results
The experiments ask whether NEO can (1) discover latent primitive operations that are never directly observed during training and (2) explain dynamics arising from previously unseen program compositions. The main failure mode to watch for is the gap between self-explainability and transferability. Monolithic latent baselines, motivated by recent latent-action models and latent program search (Bruce et al., 2024; Schmidt and Jiang, 2024; Gao et al., 2025; Macfarlane and Bonnet, 2025), can often reconstruct the support target, but the same latent action does not transfer to a query input, indicating that it encodes instance-specific information rather than a reusable theory.


On GridWorld, monolithic baselines transfer well in-distribution but largely collapse on compositional and length OOD splits. This suggests that a single latent vector can capture familiar transformations without inducing the primitive structure required for systematic transfer. In contrast, NEO maintains strong OOD transfer, and NEO-S further improves transferability by sampling candidate theories at test time.
Arithmetic Factorization is a stricter test because successful transfer requires exact multi-step symbolic execution. NEO still outperforms monolithic baselines on compositional OOD, showing that it can acquire reusable multiplicative primitives. Length OOD remains harder for the learned policy alone, but NEO-S substantially improves performance by searching over diverse compositions of the same learned primitive set.

The same pattern appears in the high-dimensional visual setting. Image Editing requires the model to infer transformations such as brightness adjustment, hue shift, flip, rotation, and masking from pixels. NEO achieves lower reconstruction error across compositional and length OOD regimes, supporting the interpretation that it explains target images through executable primitive edits rather than a single monolithic program vector.


Analysis
A central piece of evidence for L2T is whether the learned codes recover primitive structure rather than memorizing observed composite transformations. In low-\(\alpha\) settings, some primitives are never observed in isolation. A model that simply memorizes observed transformations can only learn entangled composite actions. NEO instead recovers primitive-level codes and composes them into multi-step programs, suggesting that it learns reusable operations rather than training-set-specific shortcuts.
The primitiveness analysis measures whether learned codes cover the ground-truth primitive set. The GT bar reflects only the primitives directly observable in training, so matching that bar is not enough: a model that memorizes visible transformations can appear competent without discovering hidden primitives. NEO exceeds this directly observable set and often approaches full primitive coverage, indicating that it decomposes multi-step transitions into reusable actions.


Sampling-based test-time scaling provides another view of the same structure. Because NEO represents explanations as explicit programs, inference can explore multiple candidate compositions without changing the learned executor. Increasing the sampling budget improves the chance of finding a correct execution path, especially in domains where the primitive set is learned but long-horizon program selection remains difficult.
Takeaway
Taken together, the results support the central claim of Learning-to-Theorize: structured, executable theories can be learned from raw observation pairs without program supervision. NEO’s advantage does not come merely from better reconstruction. It comes from representing explanations as compositional programs, learning primitives that can be reused across phenomena, and regulating explanation length through the MDL principle.
This proof of concept points toward world models that move beyond prediction-centric learning. Instead of only forecasting observations, such models can infer compact mechanisms that explain what changed and can execute those mechanisms in new contexts.
Limitations
This work should be viewed as an initial proof of concept for Learning-to-Theorize. The current formulation assumes a relatively small, discrete set of primitives and short program lengths, which limits scalability to domains with long-horizon, continuous, or highly structured dynamics. Primitive semantics are induced through reconstruction and are therefore not guaranteed to align with human-interpretable concepts or truly causal factors, a broader challenge for causal representation learning (Schölkopf et al., 2021). The inference procedure also relies on deterministic execution and reconstruction-based stopping criteria, which may be brittle under noise, ambiguity, or partial observability. The experiments are also restricted to controlled synthetic benchmarks. Extending L2T to richer real-world environments with complex perceptual inputs, stochastic dynamics, and open-ended theory spaces remains an important direction for future work.
BibTeX
@inproceedings{baek2026learning,
title = {Learning to Theorize the World from Observation},
author = {Baek, Doojin and Lee, Gyubin and Baek, Junyeob and Lee, Hosung and Ahn, Sungjin},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
year = {2026}
}