The Master Algorithm of AI, Finally? On LLM-Empowered Heuristic Learning

LLM agents

heuristic learning

AI theory

Has someone built Pedro Domingos’ Master Algorithm by accident? In 2015 he asked for one learner to unify ML’s five tribes — and a decade later, nobody has produced one. But over the last few months, four independent works by Weng, Wang, and Karpathy have quietly converged on the same closed loop: a frozen LLM rewrites an external artefact — rules, skills, wiki, code — instead of its own weights. By Domingos’ own criteria, this paradigm has a stronger claim to the title than anything before it. The honest reasons to hesitate may not be the obvious ones.

Author

Dell Zhang

Published

2026-05-12

TL;DR

Pedro Domingos’ The Master Algorithm (2015) posed the central question of ML in suitably grand terms: find one learner that unifies the five tribes, learns from data without per-domain engineering, and acquires knowledge of arbitrary complexity. Ten years on, nobody has produced it, and people seem to have stopped looking.
Four independent works published in the last few months — Jiayi Weng’s “Learning Beyond Gradients” (2026), Jun Wang’s Memento(2025), Memento 2 (2025) and Memento-Skills (2026) series, and Andrej Karpathy’s LLM-wiki (2026b) and AutoResearch (2026a) — have, with no apparent coordination, converged on the same paradigm. I shall call it LLM-Empowered Heuristic Learning (HL): a closed loop in which a frozen LLM edits an external, human-readable artefact (rules, skills, wiki, code) rather than its own weights.
Scored against Domingos’ three criteria, HL is the strongest candidate for the master algorithm we have ever had. It plausibly satisfies all three. The honest reasons to hesitate are not the obvious ones — Turing-completeness handles arbitrary complexity perfectly well, thank you — but harder questions about whether HL can discover genuinely new knowledge, whether the compute economics survive contact with reality, and whether a “system” architecture is the kind of master algorithm Domingos was looking for in the first place.

Domingos’ Question, Restated

Before assessing whether anything qualifies as the master algorithm, it helps to be precise about what Domingos was asking. In The Master Algorithm, the criteria are not delivered as a single numbered list — that would have been too tidy — but they are clear enough across the book and his subsequent talks:

Unification. A master algorithm must be capable of doing what each of the five tribes’ algorithms can do — symbolic rule learning, neural-network function approximation, evolutionary search, probabilistic inference, and analogical reasoning. The book’s narrative arc is precisely the search for an architecture that combines them.
Learning from data without per-domain engineering. The master algorithm should be a general learner. Domingos is explicit that hand-engineered, domain-specific systems do not count, no matter how impressive — the lesson he draws from the 1980s expert-systems era, when the rules worked beautifully provided you employed enough PhDs to keep writing them.
Knowledge of arbitrary complexity. The learner must be able to represent and acquire knowledge with no in-principle ceiling. Domingos invokes universal Turing machines and uses the language of “all the knowledge in the world,” which one suspects he means more or less literally.

These three criteria are the spine of this article. I shall resist the temptation to add a fourth (efficiency, interpretability, alignment, what have you); Domingos did not, and HL deserves the same playing field he gave his preferred candidate, Markov Logic Networks (MLNs).

A note before going further. Domingos has been publicly sceptical of LLMs — calling them inefficient, hallucination-prone, and “not truly intelligent” — and his “Tensor Logic” work is his own attempt at a unifier. I have no evidence that he has commented on Heuristic Learning specifically. The provocation in this article’s title is therefore mine, not his; I suspect he would, if pressed, ask me to step outside. What follows is an argument that HL has a stronger claim to satisfying his three criteria than any candidate he himself has proposed — and that he may be wrong to withhold the title.

What HL Actually Is

The Weng blog: an anomaly that became a paradigm

Jiayi Weng’s post opens with what he calls “the anomaly.” While maintaining EnvPool in his spare time, he wanted a cheap way to test that game environments behaved correctly without burning compute on neural-network rollouts. He asked Codex (gpt-5.4) to write a rule-based Breakout policy. The score progression — 387 → 507 → 839 → 864 — has by now become a small Internet artefact, and the appendix walks through each jump with verbatim code, RAM-byte probes, video replays, and trial logs. What grabbed me was not the score. It was the structure of what Codex produced: not a policy.py of any recognisable kind, but a self-maintaining experimental system with action probes, state readers, ball-landing predictors, stuck-loop detectors, regression tests, replay videos, and trials.jsonl. Codex did not just write a policy; it built a small laboratory and then began running experiments in it.

From this, Weng abstracts to a clean definition:

“HL is built out of program code… the object being updated is software structure rather than neural-network parameters… Its updates do not use backpropagation. The coding agent directly edits policies, state detectors, tests, configuration, or memory.”

He names the maintained artefact a Heuristic System (HS). The crucial framing — and where Weng is genuinely original — is the maintenance curve. Expert systems and rule bases failed in the 1980s not because rules are useless, but because the people maintaining them turned out to be expensive, mortal, and prone to leaving for industry. Spinning machines changed the production curve for textiles; coding agents change the maintenance curve for heuristics. That single analogy reframes about forty years of AI history.

The table at the heart of the post compares Deep RL to HL across six axes (policy, state, action, feedback, update, memory) and lists HL’s properties: explainability, sample efficiency, regression-testability, constrained overfitting via multi-seed evaluation, and partial immunity to catastrophic forgetting. Weng is honest about the limits. HL “cannot do everything neural networks can do.” On Atari games like Atlantis, VideoPinball, UpNDown, and StarGunner, PPO still smokes the coding agent. Montezuma’s Revenge required 86 hand-stitched macro-actions for a single 400-point run — which is to say, not so much a learned policy as a rather long film script.

Memento, Memento 2, Memento-Skills: the same idea, formalised

While Weng was poking Codex in his spare time, Jun Wang’s group at UCL was doing the load-bearing theoretical work. The lineage:

Memento (AgentFly) (Zhou et al. 2025). The empirical paper. Memory-augmented Markov Decision Process (M-MDP), neural case-selection policy, frozen LLM. Top-1 on GAIA validation at 87.88% Pass@3, 79.40% on the test set, 66.6% F1 / 80.4% PM on DeepResearcher. Case-based memory adds 4.7–9.6 absolute points on out-of-distribution tasks.
Memento 2: Learning by Stateful Reflective Memory (Wang 2025). The theory paper, single-authored. Wang defines a Stateful Reflective Decision Process (SRDP) as ⟨S, A, P, R, γ, M, p_LLM⟩, with the composite reflective policy:

π^μ(a | s, M_t) = Σ_{c ∈ M_t} μ(c | s, M_t) · p_LLM(a | s, c)

Read = policy improvement (KL-regularised soft Bellman backup); Write = policy evaluation (memory rewriting). The paper proves convergence to an optimal retrieval policy μ* under bounded rewards and γ<1, monotonic policy improvement on the fast time scale, and an asymptotic value gap bounded by [2 R_max / (1−γ)²] · (ε_LLM(r_M) + δ_M), where r_M is memory coverage radius, ε_LLM is LLM generalisation error, and δ_M is retrieval error. Wang’s central thesis:

“We argue that the key mechanism for continual adaptation, without updating model parameters, is reflection: the agent’s ability to use past experience to guide future actions.”
Memento-Skills: Let Agents Design Agents (Zhou et al. 2026). The instantiation. Skills are stored as structured markdown folders containing a SKILL.md declarative spec plus prompts and executable code. A behaviour-trainable router (a tuned Qwen3-Embedding-0.6B called “Memento-Qwen”) selects skills; the frozen LLM executes; failures trigger failure-attribution, skill rewrite, or skill discovery, with a unit-test gate to prevent regression. The agent starts from 5 atomic skills and grows to 41 after GAIA learning and 235 after HLE learning. GAIA training 65.1% → 91.6% over three reflective rounds; HLE 30.8% → 54.5%. On the GAIA test set, Memento-Skills scores 66.0% against the Read-Write ablation’s 52.3% — a +13.7-point gap that quantifies the value of skill-level reflection over plain case-based reasoning.

What makes the Memento–Weng pairing entertaining is that neither side has noticed the other. Wang’s papers and Weng’s blog post are months apart; the venues could not be more different (formal mathematical theory on one side, an engineer’s blog with mp4 replays on the other); and yet they converge on the same five-step loop: Observe → Read → Act → Feedback → Write. They differ in representation — Memento-Skills uses markdown skill folders manipulated through retrieval; Weng uses a single growing Python codebase manipulated directly by Codex — but they are unmistakably siblings, possibly twins separated at birth.

Karpathy’s LLM-wiki and AutoResearch: the consumer version

Andrej Karpathy released two artefacts in the same season pulling in the same direction.

LLM-wiki (Karpathy 2026b), published 4 April 2026, describes a pattern in which an LLM agent reads raw documents and compiles them into a persistent, human-readable Obsidian-compatible markdown wiki — index file, entity pages, [[wiki-links]], provenance, contradictions flagged. Karpathy’s framing is explicit: “Stop re-deriving, start compiling.” The wiki, not RAG retrieval at query time, is the persistent, compounding artefact. The gist itself reports that his research wiki on a single ML topic grew to “~100 articles and ~400,000 words,” which is to say, somewhat longer than this article.

AutoResearch (Karpathy 2026a), released 7 March 2026, is the same idea pointed at ML research. A ~630-line single-GPU stripped-down nanochat training script (train.py), an immutable evaluator (prepare.py), and a human-authored program.md. A coding agent edits train.py, trains for five minutes, keeps changes that lower val_bpb, discards the rest. Karpathy’s own two-day run produced ~700 experiments and stacked 20 additive improvements that dropped “Time to GPT-2” from 2.02 hours to 1.80 hours.

If you squint, this is just Weng’s Heuristic Learning with a different objective metric. Karpathy himself has framed the progression as vibe coding → agentic engineering → autonomous research. As Fortune reported on 17 March 2026, he announced on X that “All LLM frontier labs will do this. It’s the final boss battle,” adding that doing it is “just engineering” and “it’s going to work.” Frontier labs have, on the available evidence, called several things “the final boss battle” in the past eighteen months, and the boss appears to keep respawning. Read this one as a billboard, not a forecast. The ratchet only accepts immediate improvements, so it cannot do “worse before better” — a real limitation, flagged in the repo’s own GitHub issues with a touching lack of embarrassment.

The upstream tradition

It would be lazy to credit Weng and Wang with inventing all this. The lineage is clear: Voyager (Wang et al. 2023) introduced the ever-growing executable skill library in Minecraft; Reflexion (Shinn et al. 2023) introduced verbal reinforcement learning with text-based episodic memory; Self-Refine (Madaan et al. 2023) demonstrated iterative refinement with single-model feedback; Eureka (Ma et al. 2023) had GPT-4 evolve reward functions for IsaacGym; FunSearch (Romera-Paredes et al. 2023) combined a frozen LLM mutation operator with an evaluator to evolve Python heuristics, finding new bin-packing rules and a cap-set construction at n=8. What Weng and Wang add is not the loop. It is the explicit claim that the loop, not the model, is the unit of learning, and that this is a paradigm rather than a clever hack on top of the current one.

Scoring HL Against Domingos’ Three Criteria

Criterion 1: Unification of the five tribes

This is the criterion HL satisfies most spectacularly, and the reason it deserves the master-algorithm conversation at all. The runtime architecture pulls in:

Connectionist core. The frozen LLM is a deep neural network. Function approximation, distributed representation, gradient-trained perception — all there, in the substrate.
Symbolist outer layer. The learned object is symbolic: Python code, markdown skill files, decision rules, [[wiki-links]]. This is the artefact that grows, not the LLM. The thing being updated is exactly what a Symbolist would want updated.
Evolutionary loop. Karpathy’s AutoResearch is literally a ratchet (mutate → evaluate → keep if better). FunSearch is an island-model evolutionary algorithm. Memento-Skills’ failure-attribution-then-rewrite is a directed mutation operator. Weng’s coding-agent edits are guided mutations.
Analogizer retrieval. Memento’s case-based reasoning is literally CBR; the M-MDP formalism and the trained retrieval policy μ are pure Analogizer machinery. Memento 2’s Read step is k-nearest-neighbour over experience.
Bayesian flavour, weakly. The retrieval policy μ is in effect a posterior over which experiences are relevant; the LLM’s next-token distribution is a learned conditional. The Bayesians get the least obvious representation in HL, but they are not absent. (One might charitably say they are operating under deep cover.)

Domingos’ own preferred candidate, Markov logic networks, attempted to unify Symbolists and Bayesians, with hooks to the other tribes. His more recent tensor logic tries to unify Symbolists and Connectionists in a single differentiable formalism. HL takes a strikingly different path: rather than seeking one mathematical object that contains all five tribes, it gives each tribe a layer in a runtime system. By the unification criterion alone, HL is the most complete unifier yet proposed.

Verdict on Criterion 1: passes, possibly definitively.

Criterion 2: Learning from data without per-domain engineering

This is where the picture becomes more nuanced.

In Weng’s blog, the coding agent receives only the environment interface and a reward signal, and writes the entire policy. There is no domain-specific feature engineering, no per-game heuristic library hand-coded by humans. In Memento-Skills, the agent starts from 5 atomic skills and discovers 230 more on its own. In AutoResearch, Karpathy hands the agent train.py and a metric; the agent does the rest. In LLM-wiki, the agent reads raw documents and produces structured knowledge with no human-written extraction rules.

So at the level of the HL system, the criterion looks well satisfied: across radically different domains (Atari, MuJoCo, GAIA, HLE, ML research, document corpora), the same architecture works with the same affordances.

There is, however, a subtler form of “per-domain engineering” hiding in plain sight. The LLM itself was pretrained on enormous quantities of human-engineered text. The agent does not start from a blank slate; it starts from gpt-5.4 or Claude or Qwen. Is that a violation of Criterion 2?

I do not think it is, and here is why. Domingos’ criterion was about the learner, not the substrate. Humans bootstrap on language, culture, and several million years of evolved priors; this does not disqualify human learning from being general. AlphaZero bootstraps on the rules of chess. Every learner sits on some substrate. What Criterion 2 forbids is per-task hand-engineering — and HL avoids that. The pretrained LLM is the substrate, like the visual cortex; the HL loop is the learner.

The harder question — whether HL can ever exceed what is latent in its substrate — I shall defer to Criterion 3. For Criterion 2 as Domingos stated it, HL passes.

Verdict on Criterion 2: passes, with the caveat that “no per-domain engineering” is read as “no per-task engineering on top of a general substrate.”

Criterion 3: Knowledge of arbitrary complexity

This is the criterion I got wrong in an earlier draft, and the correction matters.

The naïve worry is that HL is “parasitic on the LLM” and therefore cannot represent or acquire anything the LLM does not already contain. This conflates bootstrapping with ceiling.

The learned artefact in HL is Turing-complete code (Weng, AutoResearch, Memento-Skills’ SKILL.md files include executable Python) or Turing-completeness-equivalent symbolic structures (LLM-wiki with backlinks and rules; Memento’s growing case base). There is no upper bound on the size or complexity of these artefacts. Memento-Skills has already demonstrated growth to 235 skills; nothing in the architecture stops it from growing to 235,000, except patience and electricity. The artefact’s representational ceiling is the Turing-computable functions — exactly the ceiling Domingos imagines for his master algorithm.

So in principle, Criterion 3 is satisfied. HL passes the formal test.

But three substantive worries sit just below the formal test, and these are the real reasons to hesitate before declaring victory:

Discovery versus recomposition. Can HL discover new knowledge — knowledge not already latent in the LLM — or is it limited to recomposing existing LLM priors? The honest empirical answer is partially. FunSearch found mathematical objects (cap-set at n=8, new bin-packing heuristics) that were genuinely novel and not present verbatim in the training data, via the LLM-mutation-plus-evaluator pattern. That is genuine discovery, and it counts. But Memento-Skills’ 235 skills are mostly recompositions of LLM priors into task-specific scaffolds, and Karpathy’s AutoResearch found incremental engineering wins rather than scientific surprises. The strong form of Criterion 3 — that the learner can in principle acquire any Turing-computable function from sufficient data — is satisfied. The strong form in practice, the one in which you would trust HL to discover the next AlphaGo move 37, remains unproven. Reasonable people can disagree about whether this constitutes a Criterion 3 failure or a separate efficiency concern.
Compute economics. Turing-completeness in principle is not tractable scaling in practice. Every edit costs a frontier-model inference. The compute curves in Weng’s paper are environment-step curves, not total-FLOP curves. If HL can in principle represent arbitrary complexity but in practice cannot pay for the search, the master-algorithm claim is weaker than it sounds. Domingos did not list efficiency as a criterion, but he plainly meant the master algorithm to be useful — not merely existent in some Platonic sense, like the answer to chess.
One algorithm or one system? The deepest reading question. In The Master Algorithm, Domingos frequently writes as if he is looking for a single algorithm in the strict sense — one learning rule, one mathematical object, one training loop. HL is not that. It is a system with layered components, each from a different tribe. If the master-algorithm question is “what is the single learning rule?”, HL is disqualified because it has at least four (gradient descent in the LLM substrate, KL-regularised Bellman backup in the retrieval policy, evolutionary ratcheting at the artefact level, code-editing at the policy level). If the question is “what is the architecture that learns generally?”, HL is the strongest answer. I lean toward the latter reading, but a Domingos purist could reasonably push back.

Verdict on Criterion 3: passes the formal test (Turing-complete artefacts grow without bound). The deeper questions — discovery versus recomposition, compute economics, one-algorithm versus one-system — are open, and they matter for whether HL is the master algorithm or merely a candidate.

The Honest Verdict

By Domingos’ own criteria, Heuristic Learning has a stronger claim than:

Markov logic networks (passes Criterion 1 partially — Symbolists + Bayesians — but never delivered Criterion 3 in practice).
Deep neural networks (pass Criterion 3 spectacularly, fail Criterion 1: pure Connectionism is not unification, however much its enthusiasts wish it were).
Genetic programming (passes Criterion 1 partially and Criterion 3 in principle, fails Criterion 2 on most domains without enormous compute).
Bayesian networks (pass Criterion 2 elegantly on structured data, fail Criterion 3 in practice owing to inference intractability).

HL is the first candidate to pass all three plausibly. That is a serious claim and I am not making it lightly. The reason it can pass all three is precisely that it gives up on finding one algorithm and accepts that a master architecture, with components from each tribe, is what unification actually looks like.

Whether you call that the master algorithm depends on your reading of Domingos. The strict reading — one mathematical object, one learning rule — says no. The functional reading — one general-purpose system that learns from data with no per-task engineering and represents knowledge of arbitrary complexity — says yes.

My honest position: HL is the master algorithm if you accept the system-architecture reading, and the closest thing we have ever had to it under any reading. The objections I find genuinely troubling are not the formal-criteria ones but the two empirical ones. First, we have not yet seen HL discover knowledge that meaningfully exceeds its LLM substrate outside narrow domains like FunSearch. Second, the compute economics are unresolved. Those are the questions the next eighteen months will answer. For the first time in a decade, somebody might actually have to put up or shut up — Domingos very much included.

Where HL Sits Among the Classical Rule Learners

For the article to be even-handed, it is worth being concrete about what HL gains and loses relative to classical Symbolist methods.

Decision trees (CART, C4.5, ID3), random forests, XGBoost, LightGBM. Extraordinarily sample-efficient, interpretable, well-understood theoretically. They dominate Kaggle on tabular data and will continue to do so long after the current enthusiasm has subsided. They cannot integrate world knowledge or operate on raw perceptual or textual inputs. HL inherits knowledge and language but loses almost all the formal guarantees.

Genetic programming and Inductive Logic Programming. GP’s whole game is evolving symbolic programs — exactly what FunSearch does at scale. ILP (see Cropper and Dumančić 2022) learns first-order logic rules with strong inductive bias and tiny data requirements, but scales catastrophically badly to noisy real-world inputs. Differentiable ILP (NLIL, GLIDR, LNNs) tries to fix this. HL is a complementary path: instead of differentiable rules, use natural-language-mediated rule editing. Less rigorous, considerably more flexible, and — for now — considerably more expensive.

Rule extraction from neural networks. A long literature (TREPAN, DeepRED, and others) attempts to distil rules from trained networks. HL inverts the problem: it generates rules natively in code form, with the LLM as a stand-in for the symbolic search engine.

The honest comparison:

Axis	Classical rule learning	LLM-empowered HL
Interpretability	High	High (code is readable)
Sample efficiency	Excellent on tabular	Excellent given priors
Generalisation	Often poor OOD	Inherits LLM priors
Compositionality	Strong for ILP/GP	Strong via skill libraries
Formal guarantees	Sometimes (PAC-style)	Mostly absent
Compute cost	Tiny	Frontier LLM at every step

Recommendations

What follows is for practitioners. I write it from the perspective of someone building a company in this space, which is to say with all the predictable biases. Caveat lector.

Stage 1 — now. If your problem has a clear scoring function and an LLM can plausibly write code that addresses it, run a Weng-style or AutoResearch-style ratchet loop overnight. Constraint: keep the artefact small enough that the agent can re-read it in one context window. Threshold to escalate: if a 100-experiment overnight run lifts your metric by more than 5% over your current baseline.

Stage 2 — next quarter. Build a Memento-style case bank for any agent you deploy in production. The 4.7–9.6 absolute-point lift on OOD tasks reported in the original Memento paper is the kind of result you can usually reproduce in-house if your task has any task-to-task structure at all. Threshold to abandon: if cases never get retrieved more than once, your task is too narrow and the overhead is not paying for itself.

Stage 3 — when frontier coding models cross the next threshold. Move from per-task program-evolution to a Memento-Skills-style cross-task skill library. The benchmark to watch is HLE. When an open-source agent with a frozen base model crosses 60% via skill accumulation alone, it is time to rebuild your stack around skill memory as a first-class primitive, rather than as an afterthought one bolts on at the end of the sprint.

Stage 4 — research bet. The unsolved problem is the HL ↔︎ NN bridge: how do you periodically distil what the skill library has learned back into the model weights without destabilising the LLM? This is the post-training problem Weng explicitly defers, and it is also the most direct route to resolving the “discovery versus recomposition” worry under Criterion 3. Anyone who solves this gets the actual prize, and they will not have to share it with the Bayesians.

Decision rules. Use HL when interpretability and per-edit sample efficiency matter, and when your action space is naturally programmatic. Use Deep RL when perception dominates, when you have cheap simulation, or when the policy is fundamentally continuous and non-symbolic. Use both when you can — this is the System-1 / System-2 hybrid most of us will eventually ship, whether we plan to or not.

Caveats

Mixed source types. Several of the central sources are blog posts or GitHub artefacts, not peer-reviewed papers. Treat these as primary evidence of where leading practitioners’ thinking is heading, not as settled science. The peer reviewers will, in their own good time, catch up.
Theory versus empirics. Memento 2’s convergence theorems hold under bounded-reward and γ<1 assumptions and require the memory coverage radius r_M to shrink. The constants ε_LLM and δ_M are not bounded analytically. The theory is a useful organising framework, not a guarantee, and it should not be confused for one.
Benchmark inflation. GAIA and HLE are saturating quickly. Memento’s 87.88% on GAIA validation is impressive, but baseline comparisons in some ablations use Qwen2.5-7B while Memento itself uses GPT-4.1 plus o4-mini, which is rather like racing a Ferrari against a Robin Reliant and being surprised at the outcome.
The Domingos framing is mine, not Domingos’. Domingos has not commented on HL specifically. His public position is sceptical of LLMs: interpretability matters more than scale, LLMs lack true understanding, regulate applications and not algorithms. He would, I strongly suspect, not call HL the master algorithm. This article is an argument that he ought to.
I have, as the disclosure forms put it, an interest. Bloo-Mind builds agentic systems and we use techniques in this family. Treat my enthusiasm with appropriate suspicion. The honest test of any of these ideas is whether they survive industrial deployment over the next eighteen months. We do not yet have that data, and anyone who tells you otherwise is, in the precise technical sense, selling something.

References

Cropper, Andrew, and Sebastijan Dumančić. 2022. “Inductive Logic Programming at 30: A New Introduction.” Journal of Artificial Intelligence Research.

Domingos, Pedro. 2015. The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Basic Books.

Karpathy, Andrej. 2026a. AutoResearch. GitHub repository. https://github.com/karpathy/autoresearch.

Karpathy, Andrej. 2026b. LLM-wiki. GitHub gist. https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f.

Ma, Yecheng Jason, William Liang, Guanzhi Wang, et al. 2023. Eureka: Human-Level Reward Design via Coding Large Language Models. https://arxiv.org/abs/2310.12931.

Madaan, Aman, Niket Tandon, Prakhar Gupta, et al. 2023. “Self-Refine: Iterative Refinement with Self-Feedback.” Advances in Neural Information Processing Systems (NeurIPS).

Romera-Paredes, Bernardino, Mohammadamin Barekatain, Alexander Novikov, et al. 2023. “Mathematical Discoveries from Program Search with Large Language Models.” Nature.

Shinn, Noah, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. “Reflexion: Language Agents with Verbal Reinforcement Learning.” Advances in Neural Information Processing Systems (NeurIPS).

Wang, Guanzhi, Yuqi Xie, Yunfan Jiang, et al. 2023. “Voyager: An Open-Ended Embodied Agent with Large Language Models.” https://arxiv.org/abs/2305.16291.

Wang, Jun. 2025. Memento 2: Learning by Stateful Reflective Memory. https://arxiv.org/abs/2512.22716.

Weng, Jiayi. 2026. Learning Beyond Gradients. https://trinkle23897.github.io/learning-beyond-gradients/.

Zhou, Huichi et al. 2025. Memento: Fine-tuning LLM Agents without Fine-tuning LLMs (AgentFly). https://arxiv.org/abs/2508.16153.

Zhou, Huichi et al. 2026. Memento-Skills: Let Agents Design Agents. https://arxiv.org/abs/2603.18743.