Fair point on the title. I've emailed to mods to see if they can update the title.
I don't think it's anthropomorphizing to study how vocabulary constraints change reasoning quality. The paper doesn't claim LLMs think. It measures accuracy on tasks with known correct answers under different constraints and finds structured patterns.
"Limiting output changes output" is true but undersells what's happening. If you removed random words from a calculator's input language you'd expect degraded or noisy results. Instead, removing possessive "to have" (so the model can't say "the argument has a flaw" and has to say "the argument fails because...") improves ethical reasoning by 19pp across all three models. Removing "to be" helps Gemini by 42pp on that same task but collapses GPT-4o-mini by 27pp on a different one. The cross-model correlation is r=-0.75, meaning the same restriction systematically helps one model and hurts another.
That's not just different output. The restrictions are forcing different reasoning paths depending on the task and the model. Why specific vocabulary removals produce specific, predictable accuracy changes is the question. Running a 15,600-trial follow-up now to dig into it further.
I ran 4,470 trials across three language models (Claude Haiku, GPT-4o-mini, Gemini Flash Lite) on seven reasoning tasks, constraining them to write in E-Prime (no "to be") or without possessive "to have." The constraints don't uniformly help — they reshape reasoning in task-specific and model-specific ways.
Key findings:
-No-Have improves ethical reasoning by 19pp (p<0.001) and epistemic calibration by 7.4pp across all models
-E-Prime improves Gemini's ethical reasoning by 42pp but collapses GPT-4o-mini's epistemic calibration by 27pp
-Cross-model correlations reach r=-0.75 — the same constraint helps one model and hurts another
-A 3-agent ensemble using linguistically diverse constraints hits 100% coverage on debugging problems vs 88% for the unconstrained control
The idea: for an LLM, language isn't a medium through which cognition passes — it IS the cognition. Designing the vocabulary an agent reasons in is a distinct engineering discipline from prompt or context engineering. I call it "Umwelt engineering" after Jakob von Uexküll's concept of an organism's perceptual world.
I'm one of those zero star repos. I've been using Claude Code for some weeks now and built a personal knowledge graph with a reasoning engine, belief revision, link prediction. None of it is designed for stars, its designed for me. The repo exists because git is the right tool for versioning a system.. that evolves every day.
The framing assumes github repos are supposed to be products.
Wait a minute! Ha, just saw this. The knowledge graph I mentioned is a separate project (heartwood on my profile). Different angle from propstore but I think we're circling the same problem, conflicting claims that shouldn't be silently resolved. Added my email to my profile now.
I've been building persistent memory for Claude Code too, narrower focus though: the AI's model of the user specifically. Different goal but I kept hitting what I think is a universal problem with long-lived memory. Not all stored information is equally reliable and nothing degrades gracefully.
An observation from 30 sessions ago and a guess from one offhand remark just sit at the same level. So I started tagging beliefs with confidence scores and timestamps, and decaying ones that haven't been reinforced. The most useful piece ended up being a contradictions log where conflicting observations both stay on the record. Default status: unresolved.
Tiered loading is smart for retrival. Curious if you've thought about the confidence problem on top of it, like when something in warm memory goes stale or conflicts with something newer.
This is really interesting. At this point you seem to be modelling real human memory
In my opinion, this should happen inside the LLM dorectly. Trying to scaffold it on top of the next token predictor isnt going to be fruitful enough. It wont get us the robot butlers we need.
But obviously thays really hard. That needs proper ML research, not primpt engineering
Personally, I think the mechanics of memory can be universal, but the "memory structure" needs to be customized by each user individually. What gets memorized and how should be tied directly to the types of tasks being solved and the specific traits of the user.
Big corporations can only really build a "giant bucket" and dump everything into it. BUT what needs to be remembered in a conversation with a housewife vs. a programmer vs. a tourist are completely different things.
True usability will inevitably come down to personalized, purpose-driven memory. Big tech companies either have to categorize all possible tasks into a massive list and build a specific memory structure for each one, or just rely on "randomness" and "chaos".
Building the underlying mechanics but handing the "control panel" over to the user—now that would be killer.
You're probably right long term. If LLMs eventually handle memory natively with confidence and decay built in, scaffolding like this becomes unnecessary. But right now they don't, and the gap between "stores everything flat" and "models you with any epistemological rigor" is pretty wide. This is a patch for the meantime.
The other thing is that even if the model handles memory internally, you probably still want the beliefs to be inspectable and editable by the user. A hidden internal model of who you are is exactly the problem I was trying to solve. Transparency might need to stay in the scaffold layer regardless.
The observations layer being append-only is smart, thats basically the same instinct as the tensions log. The raw data stays honest even when the interpretation changes.
The freshness approach and explicit confidence scores probably complement each other more than they compete. Freshness tells you when something was last touched, confidence tells you how much weight it deserved in the first place. A belief you inferred once three months ago should decay differently than one you confirmed across 20 sessions three months ago. Both are stale by timestamp but they're not the same kind of stale.
I don't think it's anthropomorphizing to study how vocabulary constraints change reasoning quality. The paper doesn't claim LLMs think. It measures accuracy on tasks with known correct answers under different constraints and finds structured patterns.
"Limiting output changes output" is true but undersells what's happening. If you removed random words from a calculator's input language you'd expect degraded or noisy results. Instead, removing possessive "to have" (so the model can't say "the argument has a flaw" and has to say "the argument fails because...") improves ethical reasoning by 19pp across all three models. Removing "to be" helps Gemini by 42pp on that same task but collapses GPT-4o-mini by 27pp on a different one. The cross-model correlation is r=-0.75, meaning the same restriction systematically helps one model and hurts another.
That's not just different output. The restrictions are forcing different reasoning paths depending on the task and the model. Why specific vocabulary removals produce specific, predictable accuracy changes is the question. Running a 15,600-trial follow-up now to dig into it further.
reply