The Benchmark Theatre: Why Almost Nothing You’ve Read About Agent Memory Scores Is True

agent memory
The agent long-term memory leaderboard is theatre. I work through the MemPalace controversy, the Mem0–Zep–Letta dispute, and the six structural reasons almost no published memory benchmark score should be taken at face value — broken ground truth, lenient LLM judges, benchmarks short enough to fit in a single context window, and the small matter that every system is evaluated by its own authors. If you are picking a memory system today, the most useful thing you can do is ignore the published number.
Author

Dell Zhang

Published

2026-05-20

TL;DR

  • Almost every headline number on the agent long-term memory (LTM) leaderboard collapses on contact with a careful reader, and the failure modes are not exotic. They are the same handful of evaluation sins, repeated by nearly every system, including the polite ones. The Mem0 vs Zep vs Letta dispute, the LoCoMo audit, the MemPalace fiasco, and the new MemoryBench / AMemGym dynamic-evaluation papers are all pointing at the same six problems, from rather different angles.
  • The deepest issue is structural, not moral. The current benchmarks (LoCoMo, LongMemEval) are short enough to fit comfortably inside a frontier context window, contain double-digit percentages of broken ground truth, are scored by remarkably lenient LLM judges that accept the majority of vague wrong answers, and are run by each vendor on its own pipeline using its own prompts. Under those conditions, “state of the art” is whatever the marketing team has decided it is on any given Tuesday.
  • If you are picking an LTM system today, ignore the published score. Run your own evaluation with full-context and filesystem-search baselines on your own data, audit the ground truth before you trust it, and demand multi-seed numbers with the judge prompt published in full. If a vendor will not share the judge prompt and the eval harness, the score does not, in any useful sense, exist.

Key Findings

  1. The flagship benchmarks are corrupted. Penfield Labs’ systematic audit of LoCoMo (Penfield Labs, 2026), the most-cited conversational-memory benchmark, found 99 score-corrupting errors in 1,540 questions — a 6.4 % rate, nearly double the “at least 3.3% errors across the 10 datasets” that Northcutt, Athalye, and Mueller reported in Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks (2021). The theoretical ceiling on the published LoCoMo is therefore about 93.6 %. Scores reported above that ceiling — including EverMemOS’s 95.96 % single-hop and 91.37 % multi-hop — are mathematically impossible without taking credit for getting wrong answers wrong in the right direction.

  2. The judges accept what they should reject. The standard LoCoMo judge (GPT-4o-mini with the original prompt) accepted 62.81 % of intentionally wrong-but-topical answers in Penfield’s stress test. Vague answers that named the right topic but missed every specific detail sailed through nearly two times in three. That is, conveniently, precisely the failure mode of weak retrieval, and the benchmark is now rewarding it.

  3. The competitive numbers are a public fight, not a measurement. Mem0’s April 2025 paper claimed SOTA over Zep on LoCoMo. Zep’s blog post — titled, with admirable directness, “Lies, Damn Lies, & Statistics” — claimed Zep actually beat Mem0 by 24 %, then quietly corrected to 10 % (75.14 % ± 0.17 against Mem0’s best of ~68 %) after Mem0’s CTO documented an arithmetic error: Zep had included Category 5 answers in the numerator while excluding them from the denominator, inflating the headline by roughly 25 percentage points. Letta then strolled in and reported 74.0 % using a plain filesystem and grep, no memory system at all, undercutting both. It is, on balance, not the field’s finest hour.

  4. The hottest project of 2026 was a marketing artefact. MemPalace, credited to actress Milla Jovovich and engineer Ben Sigman, reached 1.5 million tweet impressions and over 5,400 GitHub stars in twenty-four hours on a claim of “100 % on LoCoMo” and “the first perfect score ever recorded on LongMemEval.” Within forty-eight hours, the project’s own BENCHMARKS.md was found to confess, on line 461, that the 100 % had been produced by writing three patches for three specific dev-set questions — “teaching to the test”, in the file’s own words — and that the LoCoMo 100 % had been achieved with top-k = 50 against a candidate pool that maxes out at 32 sessions, which reduces the retrieval step to the rather less glamorous task of pouring the entire conversation into Claude Sonnet.

  5. The benchmarks measure context-window management, not memory. LongMemEval-S, the second-most-cited memory benchmark, uses “approximately 115k tokens per problem” (Wu et al., 2025). Current frontier models support 200K to 1M tokens. The entire test corpus, in other words, fits inside a single prompt. On LongMemEval-S, GPT-4o with full-context scored about 60.2 %; Mastra’s “observational memory” scored 84.23 %. The benchmark is therefore measuring how cleverly a system organises text inside an already-sufficient context window. This is not, by any reasonable definition, long-term memory.

  6. The new wave of “dynamic” benchmarks shows the old leaders losing. Three 2025-2026 papers — MemoryBench (Tsinghua), AMemGym (HKUST/Meituan, ICLR 2026), and CASCADE — explicitly construct interactive, on-policy, multi-domain test environments. MemoryBench’s authors report being “surprised to find that the time of existing memory-based methods used to memorize input context and conducting online inference are extremely long” and that “none of the advanced memory-based LLMsys (i.e., A-Mem, Mem0, or MemoryOS) can consistently outperform RAG baselines that simply use all task context and feedback logs as retrieval corpus.” This contradicts, with considerable directness, the published numbers of all three systems.

Details

Hook: The Milla Jovovich Story, and Why It Is Not a Joke

On 5 April 2026, an account named MemPalace/mempalace pushed a Python repository to GitHub (MemPalace contributors, 2026). The next day, a developer named Ben Sigman — by day the CEO of a Bitcoin-lending marketplace called Libre — tweeted the launch. The tweet credited the actress Milla Jovovich, of Resident Evil and The Fifth Element, as co-author. The repository’s headline claim was unambiguous and somewhat unhinged: “100 % on LoCoMo” and “the first perfect score ever recorded on LongMemEval. 500/500 questions, every category at 100 %.”

By the next morning the tweet had crossed 1.5 million impressions, the repo had over 5,400 stars, and the news cycle had begun. Forbes ran an uncritical puff piece. Cybernews wrote it up under the headline “Milla Jovovich creates MemPalace AI memory tool.” A “MemPalace” memecoin appeared briefly on pump.fun, with a 50/50 creator-reward split, and was pumped and dumped within twenty-four hours — pour one out, perhaps, for the late retail bagholders. The repo crossed 26,900 stars and 3,300 forks within a few days.

The choice of name was, on its face, lovely. The “memory palace” technique — formally, the method of loci — is the oldest mnemonic device in the Western tradition. It is attributed to the Greek poet Simonides of Ceos, who, having stepped outside a banquet hall moments before the roof came in, was able to identify each crushed corpse by remembering where its owner had been sitting. Cicero adapted the trick for Roman oratory; Matteo Ricci taught it to sixteenth-century Chinese scholars; Benedict Cumberbatch’s Sherlock Holmes turned it into a Tumblr meme in the 2010 BBC series, where the line “I need to go to my mind palace” became one of those phrases the show is now contractually obliged to deliver at least once an episode. Applying that scaffold to an LLM’s context store — wings for projects, halls for memory types, rooms for topics, drawers for verbatim text — is a perfectly sensible piece of design.

The problem was not the metaphor. The problem was that the marketing version of MemPalace and the codebase version of MemPalace turned out to be two quite different artefacts. Within hours, Penfield Labs published a six-thousand-word teardown. It catalogued, in order:

  • The LoCoMo bypass. The runner set top_k = 50 against conversations that contain at most 32 sessions, which means the entire candidate pool is always returned and the retrieval step reduces to reading comprehension by Claude Sonnet over every session. The project’s own BENCHMARKS.md says this verbatim: “the embedding retrieval step is bypassed entirely.” It would be hard to put it more plainly without simply admitting it on a billboard.
  • The LongMemEval metric category error. LongMemEval is an end-to-end QA benchmark scored by a GPT-4 judge. The MemPalace runner does retrieval only — it never generates an answer and never invokes a judge. It returns the top-five sessions and gets credit if any one of the gold session IDs is in that set. That is recall@5 on a labelled dataset, a meaningfully easier task than the one the LongMemEval leaderboard scores. Calling 96.6 % R@5 “the highest published LongMemEval result” is not so much a mistake as a category change announced in passing.
  • The hand-tuned 100 %. The hybrid v4 mode that produced the perfect score was built by inspecting the three remaining wrong answers in the dev set and writing targeted code for each: a quoted-phrase boost for one question, a person-name boost (“Rachel”) for another, and “I still remember” / “when I was in high school” patterns for a third. Three patches for three specific questions. Then the same five hundred questions were rerun and the result published as perfect. The project’s own benchmarks file, on line 461, calls this “teaching to the test.” Discovering this confession in the repository is rather like finding, on page 461 of the prospectus, a note explaining that the company has no products.
  • Features absent from the code. The launch tweet declared that “contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them.” The relevant file, mempalace/knowledge_graph.py, contains precisely zero occurrences of the substring “contradict”; its only deduplication is an exact-match check on (subject, predicate, object) triples. Conflicting facts about the same entity may therefore accumulate indefinitely, which is — one assumes — not the advertised behaviour.
  • AAAK is not lossless. The tweet claimed “30x lossless compression.” The dialect module truncates sentences at fifty-five characters and provides no round-trip decoder. The same BENCHMARKS.md measures AAAK at 84.2 % R@5 against raw mode’s 96.6 %, a 12.4-point regression. Lossless compression cannot, definitionally, cause measured quality loss.

On 7 April, Sigman and the maintainers posted a public “what we got wrong” note in the README, walking back every inflated claim and thanking the issue filers. The retraction is, to its considerable credit, candid: “Brutal honest criticism is exactly what makes open source work, and it’s what we asked for.” By the second week of April, the headline 100 % had quietly become 96.6 %, with the appropriate caveat that this is R@5 on session retrieval and not the leaderboard metric. The honest version of the project — verbatim storage plus default ChromaDB embeddings beats LLM-extraction approaches at session retrieval — is genuinely interesting and would have been worth a blog post. It did not require a perfect score.

The MemPalace story matters not because MemPalace was uniquely bad. It matters because every individual failure mode in MemPalace is a failure mode the rest of the AI-memory field has been quietly committing for two years, only without an actress’s name to make the press care. Penfield’s authors, who maintain their own memory project and have an obvious axe to grind, said this in the article’s most important sentence: “The methodology errors are common across the field. The honesty gap between the repository and the marketing is arguably the bigger story. The celebrity name is the reason anyone heard about it.” Which is to say: the only unusual thing here was that anyone bothered to look.

The Mem0 vs Zep vs Letta Affair

Six months before MemPalace, the same dynamic played out among the serious projects, in slower motion and without the benefit of a memecoin.

In April 2025, Mem0 published Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory on arXiv (2025). The paper reported, on the LoCoMo benchmark, a 26 % relative improvement over OpenAI’s built-in memory, p95 latency reduced by 91 %, and tokens reduced by 90 %. Mem0’s best configuration scored about 68 %. The paper benchmarked against six categories of baseline, including Zep, which Mem0 helpfully reported at 65.99 %.

On 6 May 2025, Zep published a blog post titled Lies, Damn Lies, & Statistics: Is Mem0 Really SOTA in Agent Memory? (2025) The post argued that when Zep was implemented correctly, its score on LoCoMo was 84 %, not the 65.99 % Mem0 had reported, putting Zep 24 % ahead of Mem0 rather than 3 % behind. Zep’s specific complaints: Mem0’s harness used a user-graph designed for single-user conversations but assigned both speakers to the user role; Mem0 appended timestamps to message bodies instead of using Zep’s created_at field; and Mem0 ran searches sequentially instead of in parallel.

Three weeks later, on Zep’s own GitHub, Mem0’s CTO Deshraj Yadav opened an issue with the rather dry title Revisiting Zep’s 84 % LoCoMo Claim: Corrected Evaluation & 58.44 % Accuracy (2025). The accusation was specific and damning: Zep’s calculation had included Category 5 (the adversarial category everyone agrees to exclude) in the numerator but excluded it from the denominator, mechanically inflating the score by approximately 25 percentage points. Mem0 reran Zep’s pipeline with the correction and ten independent seeds; the result was 58.44 % ± 0.20, not 84 %. Mem0 also noted, with a certain composure, that Zep’s original article had reported only a single run and had modified the system prompt in a way that gave Zep an advantage no other baseline received.

Zep then conceded. Daniel Chalef, Zep’s CEO, edited the original blog post: “we erred in how we calculated Zep’s LoCoMo score. We’ve updated the article to reflect Zep’s corrected result is 75.14 % +/- 0.17, with Zep outperforming Mem0 by 10 %.” The 10 % lead was still, technically, a lead. But the absolute number had moved from 84 % to 75.14 %, which is to say by more than a full standard deviation of the published variance across the field. The post about lies, damn lies and statistics had, on consideration, been a touch lighter on statistics than the title implied.

While that was happening, Letta — the company that came out of the original MemGPT paper — published Benchmarking AI Agent Memory: Is a Filesystem All You Need? (2025) The headline finding was that a Letta agent running GPT-4o-mini, with no specialised memory tools at all, scored 74.0 % on LoCoMo simply by storing the conversation history in a file and giving the agent grep, search_files, open, and close. This beat Mem0’s reported 68.5 % for its best graph configuration. The conclusion the post drew, with admirable understatement, was that “current memory benchmarks may not be very meaningful” and “memory is more about how agents manage context than the exact retrieval mechanism used.” This is the polite way of saying that everyone in the room had built a graph database to do the work of a Unix utility shipped in 1973.

So, to summarise: on the same benchmark, against the same dataset, in the same season, the three best-funded memory companies in the world reported, respectively, that their system was state of the art (Mem0, ~68 %), that the competitor’s number was wrong by twenty-five points (Zep, 84 % → 75 %), and that you could match any of them with a Bash builtin (Letta, 74 % via filesystem and grep). The reasonable observer’s response is not to pick a winner. The reasonable response is to stop trusting LoCoMo, and possibly to take a long walk.

What Is Actually Wrong With LoCoMo

LoCoMo (Maharana et al., 2024) was the first conversational-memory benchmark to push past the toy regime: ten synthesised dialogues, each with 19 to 32 sessions and averaging 16,000-26,000 tokens, with 1,986 question-answer pairs (1,540 non-adversarial). It was, when it appeared, a genuine contribution. It is now structurally unfit for what it is being used for, in six identifiable ways.

1. The ground truth is broken. Penfield’s audit found 99 score-corrupting errors in 1,540 questions, a 6.4 % rate. The flavour of the errors is instructive. One question asks about a car a speaker mentioned; the answer key says “Ferrari 488 GTB”; the source conversation contains only the phrase “this beauty” and an image caption reading “a red sports car”; the car model exists only in the annotator’s internal search-string field, which no memory system ingests. As Penfield puts it: “Systems are evaluated against facts they have no access to.” Twenty-four other questions attribute statements to the wrong speaker, so any system with accurate speaker tracking will systematically disagree with the answer key. One temporal question requires that “Last Saturday” on a Thursday resolve to the preceding Saturday; the answer key says Sunday; “a system that performs the date arithmetic correctly is penalised.” The theoretical ceiling, given this noise floor, is about 93.6 %. Systems reporting above that ceiling — and EverMemOS reports 95.96 % single-hop and 91.37 % multi-hop, with an overall score 1.25 points from the aggregate ceiling — are, mathematically, claiming credit for getting the wrong answers wrong in the right direction.

2. The judge is lenient. LoCoMo is scored by an LLM judge — usually GPT-4o-mini — with the original prompt explicitly instructing it to be “generous.” Penfield’s adversarial test fed the judge intentionally wrong but topically adjacent answers for all 1,540 questions. The judge accepted 62.81 % of them. Wrong-name and wrong-date errors were caught about 89 % of the time, but vague answers that named the right topic and missed every specific detail passed nearly two-thirds of the time. That is the failure mode of weak retrieval — finding the right conversation but extracting nothing specific — and the benchmark is rewarding it. When the noise floor of the judge itself is sixty percentage points wide, score differences smaller than that are not so much uninterpretable as actively misleading.

3. The conversations are short enough that “no memory” is competitive. Zep’s own blog post conceded that Mem0’s own results showed a full-context baseline (just feed the entire conversation to the LLM) scoring about 73 %, which beat Mem0’s best result of 68 %. If feeding all the text yields a better answer than the specialised memory system, the benchmark is not stressing memory. It is stressing context-window management. This is the AI equivalent of running the London Marathon and discovering the winner took the bus.

4. There is no standardised pipeline. Each system uses its own ingestion method (defensible, given architectural differences), its own answer-generation prompt (less defensible), and sometimes a different judge model (frankly indefensible). Scores from different vendors are then plotted in the same table as if they shared methodology. They do not.

5. Reproducibility is documented to fail. Third-party reproduction attempts on EverMemOS yielded 38.38 % against the official 92.32 % (issue #73). Mem0 has multiple open reproducibility issues (#3004 and #3944). The widely circulated Chinese-language cross-evaluations on Zhihu document a similar pattern: when authors run a competitor’s pipeline themselves, they routinely cannot match the published number, and the competitors respond that “the web API version has more optimisations” not in the open-source release. This is, charitably, a feature for paying customers. Uncharitably, it is unfalsifiable benchmarking. The Zhihu cross-evaluations also note an under-discussed version-drift problem: “Zep upgraded from v2 to v3 after its official LoCoMo results were published. EverMind’s evaluation used Zep v3, which may explain why Zep scored higher in EverMind’s horizontal eval than in other vendors’ evals.” When the version of the system being benchmarked is itself a moving target, the leaderboard is no longer measuring algorithms; it is measuring release cadence.

6. The category sizes do not support statistical inference. Per-category Wilson-Score 95 % confidence intervals show that 56 % of adjacent-pair comparisons across published systems are statistically indistinguishable; the Open-Domain category, with n = 96, requires a 15-point gap to distinguish any two systems. Only Mem0 documents a multi-run methodology (10 independent seeds); most systems report single-run point estimates. In a field where single-point differences of one or two percentage points are routinely advertised as victories, this is, to put it gently, a category error in basic statistics.

LoCoMo-Refined, an open project at mem-eval-suite/LoCoMo_refined, is the only serious attempt at a fix. It rewrites the judge prompt around the principle “inclusive without contradiction, complete without overreach”: answers must cover all required information, must not add unsupported content, and must align time information strictly. On 300 manually annotated samples, the new judge (Qwen3-14B with the refined prompt) achieved 86.33 % agreement with humans, against 43.67 % for the original LoCoMo setup. The team also cleaned 337 problematic samples — ambiguous wording, reversed subject-object relationships, time information inconsistent with the source conversation — releasing 1,382 questions in the refined dataset. Under the stricter rules, “the same system predictions score noticeably lower,” which is precisely what one would expect if the old benchmark had been overly forgiving. The project is, regrettably, almost unknown outside the Chinese-language research community for the prosaic reason that it has not yet had its arXiv moment in English. This is the sort of gap the field could probably fix in an afternoon.

The Six Reasons The Scores Are Not Real

The MemPalace story, the Mem0–Zep–Letta affair, and the LoCoMo audit are not separate scandals. They are three different views of the same underlying problem, which can be stated plainly in six points.

Reason 1: The LLM is doing most of the work — and you cannot tell how much

When MemPalace’s hybrid v4 pipeline scored 100 % on LongMemEval, the actual retrieval step was bypassed; Claude Sonnet was reading every session and answering by reading comprehension. When Letta’s filesystem agent scored 74.0 % on LoCoMo, GPT-4o-mini was running grep iteratively. When MemoryBench tested A-Mem, Mem0, and MemoryOS, Tsinghua’s authors found that “none of the advanced memory-based LLMsys can consistently outperform RAG baselines that simply use all task context and feedback logs as retrieval corpus.” The “memory system” in many production stacks is, on inspection, a thin wrapper around an LLM call. The score is therefore dominated by the wrapper’s choice of base model and prompt, not by the architecture being marketed. Until ablations hold the model constant — and almost none do — comparing two memory systems on benchmark numbers is comparing two LLM configurations with a layer of XML in between.

Reason 2: LLM-as-judge has known, measured, severe biases — and is rarely audited

The standard practice across LoCoMo, LongMemEval, MemoryBench, and most newer benchmarks is to score with an LLM judge. Three well-documented biases corrupt this in practice:

  • Self-preference. Wataoka, Takahashi, and Ri (2024) found that “LLMs assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated. This suggests that the essence of the bias lies in perplexity and that the self-preference bias exists because LLMs prefer texts more familiar to them.” The practical consequence is that any memory system whose answers happen to look familiar to the judge — same model family, similar fine-tuning distribution, similar prose style — gets a measurable score bump that has nothing to do with whether the answer is correct. The judges, in effect, prefer their own dialect.
  • Length and vagueness bias. Verbosity bias (preferring longer answers) and the “vague-but-topical” bias documented by Penfield (62.81 % acceptance rate on intentionally wrong vague answers) systematically reward the wrong behaviour: paraphrasing the question back at the user rather than producing a specific recalled fact.
  • Position bias. When the same two answers are presented in reversed order, judges’ verdicts flip more often than is statistically tolerable; Shi et al.’s IJCNLP 2025 systematic study (2025) found that judge model choice had the single largest impact on positional bias of any variable tested.

The fix is mechanical and rarely applied: validate the judge adversarially with intentionally wrong answers, publish the judge prompt verbatim, and report the human-agreement rate. The LoCoMo-Refined project did this and showed a forty-point gap between the original setup (43.67 % human agreement) and the refined one (86.33 %). The other published scores are, by implication, running on the lower number.

Reason 3: The corpora fit in a context window

LongMemEval-S, with “approximately 115k tokens per problem” (Wu et al., 2025), fits inside any 200K-token frontier model. LoCoMo, at 16-26K tokens per conversation, fits in essentially any modern context window with room to spare for a chat about the weather. Mem0’s own paper showed full-context baselines beating Mem0. Mastra showed GPT-4o full-context scoring 60.2 % and observational memory scoring 84.23 % on LongMemEval-S, which means the benchmark is measuring how cleverly text is organised inside an already-sufficient context, not whether information can be recalled from outside it. A long-term memory benchmark whose entire corpus fits in a single prompt is, on the kindest reading, not measuring long-term memory. It is measuring chain-of-thought efficiency in a particularly inefficient way. Penfield’s audit puts the point unsparingly: “LongMemEval-S tests whether a model can locate information within 115K tokens. That is a useful capability to measure, but it is a context window test, not a memory test.”

Reason 4: Conflict of interest — the system authors are running the evaluations

The Mem0 paper benchmarked Zep with a misconfigured Zep harness, mapping both speakers to the user role and bypassing the SDK’s timestamp API. Zep’s response benchmarked Zep with a corrected harness, a Zep-favourable prompt, a single seed, and (as later acknowledged) an arithmetic error. Mem0’s response re-corrected the arithmetic. Letta benchmarked Letta, and so did Cognee, LangMem, A-Mem, MemoryOS, EverMemOS, Supermemory, ByteRover, OMEGA, Hindsight, SuperLocalMemory, ENGRAM, and now MemPalace. Each system author is the sole party with deep enough knowledge of their own system to configure it correctly, and the sole party with a strong incentive to misconfigure the competition. This is, to be quite clear, the configuration that science has been politely refusing to use since approximately 1665. The result is exactly as one would expect: every system reports being state of the art on the benchmark of its choice. The Zhihu cross-evaluations note that “only MemOS and EverMemOS have publicly disclosed complete horizontal evaluation methodologies,” which is to say that of the dozen-plus systems on the leaderboard, only two have published reproducible cross-evaluations against the others. The other ten or so are essentially advertisements.

Reason 5: Single-trial reporting with no statistical significance

Mem0’s paper used 10 independent runs and reported standard deviations. This is, somewhat embarrassingly for the field, the high end of current practice. Most published numbers — Zep’s original 84 % LoCoMo, Letta’s 74 %, MemPalace’s 96.6 %, EverMemOS’s 92.32 %, the various OMEGA numbers — are single-run point estimates. Given that LoCoMo’s per-category sample sizes range from 96 (Open-Domain) to 841 (Single-Hop), an 8.8x ratio, the Wilson-Score 95 % confidence intervals around these single points are remarkably wide. Even with ten clean reruns, Penfield calculates that Open-Domain remains 3.0x less precise than Single-Hop. None of the public score tables show error bars. The field is, in effect, reporting third-decimal-place precision on data that does not honestly support second-decimal-place precision. One could be forgiven, on reading these tables, for thinking they came from a sales deck.

Reason 6: Static benchmarks miss the actual job

This is the deepest critique, and the one the field is — finally, belatedly — starting to address. Static QA benchmarks ask: “Given this fixed log of past interactions, can the memory system retrieve the right answer to this question?” Real agent memory has to support a continuous loop: the agent acts, the user reacts, the system writes that reaction back into memory, and the next decision is conditioned on the updated state. Three recent papers attack this gap explicitly.

MemoryBench (Ai et al., 2025). The authors construct 20,000 cases across eleven datasets in three domains, four task formats, and two languages, with a user-feedback simulation framework. They benchmark A-Mem, Mem0, and MemoryOS against vanilla and SFT baselines on Qwen3-8B. The headline finding, on which the authors are entirely direct about contradicting prior published results: “none of the advanced memory-based LLMsys can consistently outperform RAG baselines that simply use all task context and feedback logs as retrieval corpus” — and previously published wins on LoCoMo “do not generalise” outside reading-comprehension settings. They are also “surprised to find that the time of existing memory-based methods used to memorize input context and conducting online inference are extremely long.” MemoryOS, per the paper, “exhibited better consistency in terms of inference time, but it took much longer time in memory construction (mostly longer than 17 seconds per case).” Mem0 could not finish two of their task types (“Open-Domain” and “LiSo”) because its memory pipeline could not process the input in reasonable time. Some methods, the paper drily notes, would have required “days or even weeks” to run on-policy across all datasets. A memory system that beats the leaderboard but cannot run the test in a working week is not, in practice, a memory system anyone is going to deploy.

AMemGym (Jiayang et al., 2026). This is the cleanest critique of off-policy evaluation in print. The authors note that all existing benchmarks “rely on static, off-policy data as context, limiting evaluation reliability and scalability.” Worse, “off-policy evaluation introduces reuse bias, undermining memory optimization and configuration selection.” In their direct comparison, the off-policy ranking of memory implementations disagrees with the on-policy ranking by up to three positions. Concretely, on the on-policy benchmark, the best agentic-write configuration scored 0.291 against 0.253 off-policy — and switching from off-policy to on-policy made the same configuration jump three ranks. The agents themselves degrade catastrophically: when ground-truth state is injected into context (the upper bound), all evaluated LLMs scored above 0.8; “however, as the interaction history grows with state updates, their performance drops sharply, with most models falling below 50 % of their upper bounds. Some models even perform no better than random guessing in later periods.” Claude Sonnet 4, the best model they tested, peaks at 0.336 on-policy. This is the field’s polite way of saying the kings, on closer inspection, are not wearing anything in particular.

CASCADE (Guo et al., 2026). The third paper formalises “deployment-time learning” as a contextual bandit problem and shows, in the words of the abstract, that “across 16 diverse tasks spanning medical diagnosis, legal analysis, code generation, web search, tool use, and embodied interaction, CASCADE improves macro-averaged success rate by 20.9 % over zero-shot prompting while consistently outperforming gradient-based and memory-based baselines” — implying, by simple arithmetic, that the memory baselines being beaten were leaving substantial room on the table that the static benchmarks had simply failed to surface.

Several other efforts (MEMTRACK from Patronus AI, MemoryArena, Mem2ActBench, MemGUI-Bench, LifeBench, ConvoMem, StoryBench) point in the same direction: when memory is evaluated in interactive, multi-session, action-conditional settings, the current SOTA falls off a cliff. Patronus’s MEMTRACK (Deshpande et al., 2025) reports that “the best performing GPT-5 model only achieves a 60 % Correctness score on MEMTRACK” — and the more elegant moral across the entire literature is that the “memory systems” beating each other by single points on LoCoMo are roughly indistinguishable from a no-memory baseline once you switch to a task where memory actually matters.

Notable Numbers, For The Record

Because the field’s reported numbers are unstable and the noise floor handsomely exceeds many of the gaps being claimed, I will not produce a ranked leaderboard. The following are widely cited published scores that any sensible reader should treat as version-dependent, vendor-contested, and prone to revision without notice:

  • Mem0 on LoCoMo: ~66.9 % base; ~68.5 % graph variant (Mem0 paper, 2025).
  • Zep on LoCoMo: 65.99 % (as reported in the Mem0 paper) → 75.14 % ± 0.17 (corrected, 10 runs, by Zep in 2025) → 58.44 % ± 0.20 (re-corrected by Mem0 on Zep’s own pipeline). Pick your preferred reality.
  • Letta on LoCoMo: 74.0 % using a filesystem and grep only, GPT-4o-mini (Letta, August 2025).
  • MemPalace on LoCoMo: 60.3 % R@10 in raw mode, no LLM rerank (per the project’s own honest documentation); the headline 100 % is invalid for the reasons set out above.
  • Zep on LongMemEval-S: 71.2 % with GPT-4o (Zep 2025); also reported 63.8 % on the temporal slice elsewhere.
  • Mastra OM on LongMemEval-S: 94.87 % (vendor-reported, GPT-5-mini actor); against ~60.2 % for GPT-4o full-context — the gap reflects context-window management, not memory retrieval.
  • EverMemOS on LoCoMo: 92.32 % overall; third-party reproduction on issue #73 reported 38.38 %.
  • MemPalace on LongMemEval-S: 96.6 % R@5 raw mode — but this is recall on labelled session IDs, not the leaderboard’s QA-judge metric; the hand-tuned 100 % is invalid.
  • A-Mem, Mem0, MemoryOS on MemoryBench: none consistently beats the RAG-with-all-context baseline (Ai et al., 2025).
  • GPT-5 on MEMTRACK: 60 % Correctness, the ceiling among all tested models (Deshpande et al., 2025).

The instructive observation is the spread. On LoCoMo alone, the same paper (Mem0’s) puts Zep at 65.99 %, the corrected version puts it at 58.44 %, Zep’s own (corrected) blog post puts it at 75.14 %, Letta puts a filesystem at 74.0 %, the full-context baseline runs around 73 %, MemPalace’s honest score is 60.3 %, and EverMemOS claims 92.32 % which a third party reproduces at 38.38 %. That is a range of 54 percentage points, on the same benchmark, with the same dataset, across the same approximate year, attributable not to algorithms but to evaluation choices. A field in which the same system can score 38 % or 92 % depending on who is running the test is not a field with reliable benchmarks. It is a field with very enterprising marketing departments.

What The Dynamic Benchmarks Actually Test

A short note on what a properly-shaped LTM benchmark looks like, because the next round of leaderboards will inevitably use these and the public reading them deserves a chance to understand the difference.

On-policy means the evaluation conversation is generated during the test, by interaction with the system being evaluated, rather than replayed from a fixed transcript. This matters because the agent’s own behaviour shapes what the user says next, which shapes what the memory system has to store, which shapes the next decision. Off-policy evaluation on a fixed transcript treats memory as a retrieval task; on-policy evaluation treats it, more honestly, as a control loop.

State-grounded means each user has a structured underlying state (preferences, facts about their life, ongoing tasks), and the user simulator’s job is to reveal portions of that state through natural conversation, not merely to ask questions about it. AMemGym is the cleanest example. Its simulator pulls personas from NVIDIA’s Nemotron-Personas (100K personas), samples question schemas, generates state-evolution trajectories with narrative life events, and writes ground-truth answers for each (question, state) pair. The metric is whether the assistant produces the right personalised answer given everything the user has gradually exposed across the conversation. The role-play user simulator (GPT-4.1) is meta-evaluated against humans at Gwet’s AC1 >= 0.96 — a level of judge-validation almost no static benchmark approaches.

Long-horizon means more than a single session, and ideally enough turns that the test exceeds any plausible context window. AMemGym’s “extra” configuration requires 512K+ tokens of context, putting it comfortably beyond the comfortable range of any current frontier model.

Action-conditional means the agent is actually doing something — taking actions in an environment with consequences — and the memory has to support those actions rather than merely answer questions about the past. SWE-Bench, WebArena, and Patronus’s MEMTRACK fall in this family. MEMTRACK’s GPT-5 ceiling of 60 % correctness is, in my reading, the single most important data point in the entire LTM literature: the best model in the world, on a memory-and-state-tracking task across realistic enterprise tools (Slack, Linear, Git), is at 60 %. The marketing material, by curious coincidence, almost never mentions this.

Once you start using benchmarks of this shape, the entire current LTM leaderboard collapses into noise. The systems are not being tested on the thing they are being sold to do.

Recommendations

For practitioners actually choosing an LTM system today, here is what I would do. These are staged so you can stop at the first one that gives you a clear answer.

Stage 1 — now. Ignore the leaderboard. Build a full-context baseline and a filesystem-with-grep baseline on a sample of your data. If your task fits in a 200K-token context, the full-context baseline will be embarrassingly competitive, and you will have just saved yourself a vendor relationship. If grep matches your vendor’s number within five points, your vendor is selling you grep with a logo on it. Threshold to escalate: a memory system needs to beat full-context plus grep by at least ten points on your own task before it justifies its existence on your stack.

Stage 2 — next quarter. If you do actually need a memory layer, run two independent evaluations using two different judge models with two different (published, adversarially-validated) prompts. Demand multi-seed numbers — at least ten runs — and ask the vendor for the standard deviation. If they have not measured it, they do not know what they are selling. Threshold to abandon: if the variance across seeds is larger than the gap to your second-choice vendor, you are choosing on noise, which is to say you are not really choosing.

Stage 3 — when the benchmarks improve. Track MemoryBench, AMemGym, MEMTRACK, ConvoMem, and LoCoMo-Refined. When two or more of these become the standard reporting metric and the vendor leaderboards converge on the on-policy numbers, the field will have had its honest reckoning. Until then, the direction of progress on dynamic benchmarks is informative, but the absolute numbers on static benchmarks are decorative.

Stage 4 — strategic. If you are building an agent product whose value genuinely depends on memory, the most defensible position is to own your evaluation. Build a small, private, adversarially-judged, on-policy benchmark over your real use cases and update it monthly. Use it as the gate for any vendor swap. This is the same lesson the search-quality teams learned in the late 2000s; the LTM field is, on a generous reading, about ten years behind that curve.

Decision rules. Use a managed memory service (Mem0, Zep cloud) when the operational simplicity is worth more than the benchmark uncertainty — early prototypes, customer-support bots, simple personalisation. Use the agent-managed-memory pattern (Letta, MemGPT-style) when the agent will run long enough that the model’s own tool use will outperform any external memory abstraction. Use local-first stores (Cognee, ChromaDB-on-disk, MemPalace in its honest mode) when privacy or cost dominates. Treat all benchmark deltas under ten points as noise, and any vendor unwilling to publish their judge prompt as a polite no.

Caveats

  • I have skin in the game. Every author writing on this subject does. I am writing from the perspective of someone who has shipped memory-augmented agents and who has, on more than one occasion, been the person who chose between Mem0 and Zep and Letta and ended up rolling their own. Treat my preference for adversarially-validated, on-policy evaluation with the appropriate suspicion of someone who has been comprehensively burned by the alternative.
  • The honest core of several projects is still genuinely interesting. MemPalace’s raw verbatim retrieval beating LLM-extraction approaches on session retrieval is, I think, a real and useful negative result. Zep’s temporal knowledge graph does seem to help on knowledge-update questions. Letta’s filesystem result is the most important single methodological data point of 2025. Mem0 is, by some distance, the most operationally polished managed service. The critique here is not of the engineering. It is of the marketing’s relationship to the engineering.
  • The benchmarks are improving in real time. LoCoMo-Refined exists. MemoryBench, AMemGym, MEMTRACK, and CASCADE are all from the last twelve months. The field is in the middle of correcting itself, which is to its credit. The numbers in this article are accurate to mid-May 2026 and will be stale within a quarter; the structural critique should age rather better than the specific scores.
  • The Penfield Labs critique is partisan. Penfield maintains a competing memory project, and has the natural axe to grind. I have cross-checked their numerical claims (99 errors in 1,540 questions; 62.81 % judge acceptance of vague wrong answers; the SHA256-verified audit repository at dial481/locomo-audit) and they hold up; the audit dataset is open and independently inspectable. But the editorial framing in their Substack post is, like Zep’s “Lies, Damn Lies” post and Mem0’s GitHub-issue rebuttal, a vendor critique of another vendor. The field’s structural problem is that this remains the only kind of public critique it routinely receives.
  • Reproducibility is now an explicitly political variable. Multiple systems have responded to public reproduction failures with a polite shrug and the explanation that “the cloud / paid version has additional optimisations.” This is a perfectly reasonable business model and a perfectly disastrous epistemic practice. If you are evaluating any LTM system, demand that the cloud version be benchmarked against the open-source version on the same task, and demand to see the gap. Where this gap is hidden, assume it is large.
  • The MemPalace authors deserve the partial credit they have taken. Of all the parties in this saga, the MemPalace team is the one whose public response to scrutiny was the most rapid and the most candid. Sigman and Jovovich pushed a corrected README within forty-eight hours, listed every error, and thanked the issue filers by name. Their headline numbers were inflated; their post-mortem behaviour was, by the field’s standards, exemplary. That this is worth pointing out says rather more about the field than it does about MemPalace.

Further Reading

For readers who would like to follow this research direction more systematically — the products, tutorials, surveys, benchmarks, and papers etc., that actually define the agent-memory field — I maintain a curated, regularly updated list with collaborators at Awesome-Agent-Memory. It is where I would point anyone trying to get oriented quickly. Pull requests, are, as ever, very welcome.

References

Ai, Q., Tang, Y., Wang, C., Long, J., Su, W., & Liu, Y. (2025). MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems. https://arxiv.org/abs/2510.17281
Chalef, D. (2025). Lies, Damn Lies, & Statistics: Is Mem0 Really SOTA in Agent Memory? Zep blog post. https://blog.getzep.com/lies-damn-lies-statistics-is-mem0-really-sota-in-agent-memory/
Chhikara, P., Khant, D., Aryan, S., Singh, T., & Yadav, D. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. https://arxiv.org/abs/2504.19413
Deshpande, D., Gangal, V., Mehta, H., Kannappan, A., Qian, R., & Wang, P. (2025, October). MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments. NeurIPS 2025 SEA Workshop. https://arxiv.org/abs/2510.01353
Guo, S., Du, Y., Chen, H., Chang, Y., & Wang, J. (2026). CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment. https://arxiv.org/abs/2605.06702
Jiayang, C., Ru, D., Qiu, L., Li, Y., Cao, X., Song, Y., & Cai, X. (2026, March). AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2603.01966
Letta Team. (2025). Benchmarking AI Agent Memory: Is a Filesystem All You Need? Letta blog post. https://www.letta.com/blog/benchmarking-ai-agent-memory
Maharana, A., Lee, D.-H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of LLM agents. Annual Meeting of the Association for Computational Linguistics (ACL). https://arxiv.org/abs/2402.17753
MemPalace contributors. (2026). MemPalace: Local-First AI Memory. GitHub repository. https://github.com/MemPalace/mempalace
Northcutt, C. G., Athalye, A., & Mueller, J. (2021). Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2103.14749
Penfield Labs. (2026). We Audited LoCoMo: 6.4% of the Answer Key is Wrong. Reddit, r/MachineLearning. https://www.reddit.com/r/MachineLearning/comments/1s54cvg/d_we_audited_locomo_64_of_the_answer_key_is_wrong/
Shi, L., Ma, C., Liang, W., Diao, X., Ma, W., & Vosoughi, S. (2025, December). Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL). https://doi.org/10.18653/v1/2025.ijcnlp-long.18
Wataoka, K., Takahashi, T., & Ri, R. (2024, October). Self-preference bias in LLM-as-a-Judge. NeurIPS 2024 Safe Generative AI Workshop. https://arxiv.org/abs/2410.21819
Wu, D., Wang, H., Yu, W., Zhang, Y., Chang, K.-W., & Yu, D. (2025). LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2410.10813
Yadav, D. (2025). Revisiting Zep’s 84% LoCoMo Claim: Corrected Evaluation & 58.44% Accuracy. GitHub issue, getzep/zep-papers #5. https://github.com/getzep/zep-papers/issues/5