- Python 96.4%
- TeX 1.9%
- Shell 1.6%
-181 vs DeepSeek BPE, -169 vs Llama BPE (p<0.001) Phase EPSILON closes the methodological caveat from Phase DELTA (all kinds shared word-level vocab) by running the three SOTA training recipes on their native BPE 32K tokenization while keeping MASHA v0.3 on word-level. Comparison is on BPC (tokenization-fair). Headline (n=6, df=5, all paired): Δ masha_v3 - baseline_bpe: μ=-0.3528 σ=0.0053 t=-163.57 Δ masha_v3 - deepseek_bpe: μ=-0.3701 σ=0.0050 t=-181.13 Δ masha_v3 - llama_bpe: μ=-0.3525 σ=0.0051 t=-169.50 All three at two-tailed p<0.001 (crit |t|>6.87). 6/6 seeds favourable. The 0.35 BPC gap means MASHA encodes the same held-out text using ~14% fewer bits per character than any of the three BPE baselines, with the same compute and the same data. Cross-phase summary (six paired-t tests, all p<0.001): vanilla word (DELTA) : t=-10.46 DeepSeek word (DELTA) : t=-39.40 Llama word (DELTA) : t= -7.13 vanilla BPE (EPSILON): t=-163.57 DeepSeek BPE (EPSILON): t=-181.13 Llama BPE (EPSILON): t=-169.50 This is the strongest single result of the MASHA campaign so far. What remains: - Phase GAMMA - full Wikipedia PT (~1.5B tokens, scale test) - Phase BETA - sense-aware Stream B (polysemy) - Cloud robustness (issue #23) |
||
|---|---|---|
| .config | ||
| data | ||
| docs | ||
| experiments | ||
| masha | ||
| masha_grad | ||
| migrations | ||
| paper | ||
| scripts | ||
| tests | ||
| tinygrad@6b0a9f5ee6 | ||
| .gitignore | ||
| .gitmodules | ||
| Agent_Smith.md | ||
| alembic.ini | ||
| pyproject.toml | ||
| README.md | ||
MASHA
Modelo de Arquitetura Semântica Hierárquica Avançada — a neuro-symbolic Portuguese language model that treats language as a hierarchical chemical system: characters → morphemes → words → phrases.
IA soberana começa por entender a própria língua.
Project state — 2026-05-13 (v0.3 refactor in progress)
The architecture is in the third refactor. Each iteration was triggered by a concrete empirical finding from the prior version:
| Version | Architecture | Outcome that motivated the next iteration |
|---|---|---|
| v0.1 | Pure baseline (vanilla decoder Transformer, no priors) | Reference point. Established the paired-t methodology and the ablation rig. |
| v0.2 | Morphemic + grammar priors summed into the input embedding before the Transformer | Local paired-t n=6 mixed-objective showed Δabc = −41 PPL (two-tailed p<0.05). A cloud expedition on an H200 NVL exposed numerical fragility — the sign of the effect inverted under cu130 vs cu128. See docs/CLOUD_EXPEDITION_REPORT.md. |
| v0.3 (current) | Two separated streams: distributional LLM + symbolic prior, fused at logit level by a learned gate | Mathematical lower bound: if the gate goes to 1, MASHA degenerates exactly into the baseline. Cannot be strictly worse than baseline. See docs/MASHA_V3_DESIGN.md. |
Today's success criterion (the one we actually care about):
A MASHA model learning Wikipedia PT must outperform a parameter-matched LLM trained with state-of-the-art recipes (DeepSeek-style, Llama-style, GPT-style) on the same Wikipedia PT, under the same compute budget, on the same paired seeds.
That is what v0.3 is built to test. Apples-to-apples, same machine (A5000 16 GB), same data, same gradient steps, same eval rig. The paired-t framework of Passos A/B/C already implements this — v0.3 only changes the architecture under test, and the ablation rig is being extended to add SOTA training-recipe baselines (not just vanilla Transformer).
v0.3 in one diagram
Stream A — Distributional LLM (bit-identical to baseline)
word_ids → token_embed → 12-layer Transformer → lm_head → logits_A [V]
Stream B — Symbolic prior (NEW)
word_ids → lookup(root, prefix, suffix, POS) → MLP → logits_B [V]
(uses morphemic + grammar features, NOT the token id itself)
Gate (NEW)
α = sigmoid(linear(h_A)) ∈ [0, 1] per position
initialized so α ≈ 0.95 at step 0 (model behaves like baseline initially)
Fusion at logits
logits = α · logits_A + (1 − α) · logits_B
The architectural property v0.2 lacked: when α → 1 everywhere, logits = logits_A = baseline(x). The gradient is free to drive α to 1 wherever Stream B is noise, so v0.3's loss is upper-bounded by the baseline's loss in the limit of sufficient training.
Where we are this week
- ✅ v0.3 design doc:
docs/MASHA_V3_DESIGN.md - ✅ v0.3 implementation:
masha/model/masha_v3.py(withSymbolicStream, gate, fused forward) - ✅ Training-loop integration:
scripts/run_ablation_word.pyships a fourth kindmasha_v3alongsidebaseline/masha_ab/masha_abc - ✅ Smoke test: 200 steps × seed 41, v0.3 produced lower PPL and higher top-1 than the baseline on the first run (lower-bound guarantee respected; sign already favourable at init)
- 🔄 n=3 paired-t at local scale (256 K tokens/cell)
- ⏳ n=6 mixed-objective paired-t (the regime where v0.2 locally crossed p<0.05)
- ⏳ Phase BETA: sense-aware Stream B (polysemy, literal vs figurative — see design doc)
- ⏳ Phase GAMMA: expand training corpus from 40 MB Wikipedia sample to full Wikipedia PT (~1.5 B tokens / 5 GB)
- ⏳ Phase DELTA: comparison ablation against SOTA training recipes (we have to measure ourselves against the big players, not just a vanilla baseline) — see "Methodological comparison" below
Methodological comparison — measuring against SOTA recipes
A vanilla Transformer baseline is the easy yardstick. The honest yardstick is modern LLM training recipes as published by DeepSeek, Llama, Mistral, Qwen — the techniques used by the systems that actually set the state of the art.
For our nano-scale paired-t rig, the comparison kinds we are extending the ablation matrix with:
| Kind | What it adds vs vanilla baseline | Reference |
|---|---|---|
baseline |
vanilla causal LM, AdamW, cosine LR | (reference point) |
baseline_deepseek |
Multi-Token Prediction objective (predict next K tokens, not just 1), DeepSeek-style LR schedule, weight init | DeepSeek-V3 technical report, 2024 |
baseline_llama |
RoPE θ=500k, SwiGLU + slightly different init, attention dropout policy from Llama 3 | Llama 3 paper, 2024 |
masha_abc (v0.2) |
Morphemic + grammar fused into input embedding | This repo, docs/PASSO_B_MIXED_OBJECTIVE_DESIGN.md |
masha_v3 (v0.3) |
Two-stream priors + gated logit fusion | This repo, docs/MASHA_V3_DESIGN.md |
The point: if masha_v3 beats baseline_deepseek on paired-t over the same data and compute, then MASHA has a publishable story. Beating only baseline is the bare minimum.
This comparison ablation is Phase DELTA. Implementation tracked in scripts/run_ablation_word.py once Phase ALPHA (current v0.3 paired-t) and Phase BETA (sense-aware) land.
What makes MASHA different
Standard LLMs treat language as a flat sequence of BPE tokens learned statistically from billions of examples. MASHA encodes explicit linguistic structure as an inductive bias:
| Level | What it captures | Where the knowledge comes from |
|---|---|---|
| Quantum | Characters, punctuation, accents | Defined a priori from grammatical theory |
| Morphemic | Etymological roots, prefixes, suffixes | Houaiss étim field (submodule git.pop.coop/pop/etimologia) |
| Atomic | Words composed of morphemes + grammatical class | Houaiss dictionary (submodule git.pop.coop/pop/dicionario) |
| Molecular | Phrases & clauses bound by grammatical relations | Cunha & Cintra (submodule git.pop.coop/pop/gramatica) |
The Morphemic level is the key compression: avião, aviador, aviação all share the root AVI (Latin avis "ave"), so the model learns one embedding for the root and composes it with suffix embeddings. ~5 K roots × ~50 prefixes × ~150 suffixes cover ~80 % of PT-BR vocabulary.
Architecture at a glance
flowchart LR
subgraph CORPUS["Wikipedia PT (parquet, plain text only)"]
TXT["id · text"]
end
subgraph PG["Postgres — symbolic substrate"]
MORPH["morphemes<br/>~42 K ROOT/PREFIX/SUFFIX"]
WM["word_morphemes<br/>~60 K decompositions<br/>(GIN-indexed arrays)"]
LEX["lexicon_entry<br/>Houaiss headwords"]
end
subgraph CACHE["WordDecompCache (in-memory, loaded once)"]
ROOT["root_id_of[word]"]
PFX["prefix_ids_of[word]"]
SFX["suffix_ids_of[word]"]
GC["grammar_class_id(word)"]
end
subgraph DEPS["Stanza deps (parquet)"]
DP["per-article<br/>dependency arcs"]
end
subgraph MODEL["MASHA model (decoder-only Transformer + GNN)"]
direction TB
E1["E_token (atomic)"]
E2["E_root + Σ E_prefix + Σ E_suffix (morphemic)"]
E3["E_gclass (POS-tag)"]
SUM(("⊕"))
ATTN["Self-attention<br/>+ dependency bias matrix B"]
GNN["GATv2 over dep graph<br/>(grammar layer)"]
OUT["Linear → vocab"]
end
TXT -- "lookup at train time" --> ROOT
TXT -- "lookup at train time" --> PFX
TXT -- "lookup at train time" --> SFX
TXT -- "lookup at train time" --> GC
MORPH --> ROOT
WM --> ROOT
WM --> PFX
WM --> SFX
LEX --> ROOT
ROOT --> E2
PFX --> E2
SFX --> E2
GC --> E3
TXT --> E1
E1 --> SUM
E2 --> SUM
E3 --> SUM
SUM --> ATTN
DP --> ATTN
DP --> GNN
ATTN --> GNN
GNN --> OUT
The dashed-looking flow on the left is the Postgres trick — the corpus parquet has no annotations baked in; the cache pulls them from Postgres at the start of training. Ablations toggle which annotations are read, not which corpus is loaded. The right side is the model itself: token / morphemic / POS embeddings are summed at the input; self-attention runs with an optional dependency-bias matrix; a GATv2 GNN passes messages over the dependency graph in parallel.
Position in the literature
How MASHA relates to the closest published work on syntax- and morphology-aware language models:
| Aspect | TGs (DeepMind, 2022) | DTGs (ShanghaiTech, 2024) | Oseki et al. (Tokyo) | LISA (Strubell, 2018) | MASHA |
|---|---|---|---|---|---|
| Structure type | Constituency | Dependency | Both | Dependency (SRL) | Dependency |
| Integration | Attention mask | arc-eager / arc-standard transitions | Implicit supervision | Supervised attention head | GNN + attention bias + morphemic layer |
| Explicit grammar weight | No | No | No | No | Yes (POS-tag embedding) |
| Character-level ("quantum") layer | No | No | No | No | Yes |
| GNN encoder | No | No | No | No | Yes (GATv2, Brody et al. 2022) |
| Neuro-symbolic | Partial | Partial | No | Partial | Yes (Houaiss as a priori symbol) |
| Currently validated scale | Medium | Small | Small | Small | Nano (125-250 M, on a laptop A5000 16 GB) |
| Roadmap target | — | — | — | — | Micro 350 M → Small 1.3 B (hardware-blocked) |
The combination no published work has assembled, as far as we have found, is: GNN over the dependency graph + a priori grammatical weights (not learned from scratch) + structural attention bias + etymological morphemic layer. At the current nano scale the effect is in the expected direction but lacks statistical power (~1 % PPL reduction within seed variance). The table describes what is architectural and reproducible today; the "does it actually beat all of these" question depends on the micro / small runs that need hardware MASHA does not yet have.
How training data flows — the Postgres trick
A standard ML pipeline pre-vectorises the corpus once: tokenise, attach every annotation (POS tag, dependency edges, morpheme IDs, …) and dump the whole thing as fixed tensors (Parquet / Arrow / HF Datasets). Every change to an annotation forces a full corpus rebuild — and at PT-BR Wikipedia scale that costs hours of Stanza dependency parsing and gigabytes of disk.
MASHA refuses that bargain.
- The corpus stays plain text.
data/processed/wikipedia_pt_sample.parquetcarries(id, text)only — no token IDs, no morpheme IDs, no POS tags. Re-decomposing the morphemes never touches it. - Annotations live in Postgres, normalised and indexed. Three tables (
morphemes,word_morphemes,lexicon_entry) hold the symbolic substrate: ~42 K morphemes (ROOT / PREFIX / SUFFIX), ~60 K Houaiss-anchored decompositions, and the full lexicon as fallback. GIN indexes on theprefix_ids[]/suffix_ids[]array columns make queries fast (masha/db/models.py). - At training start,
WordDecompCache.load_from_pg()builds a flat in-memory cache. Per-batch lookups are O(1) on hash maps (root_id_of[word],prefix_ids_of[word],suffix_ids_of[word],grammar_class_id(word)). The cache is ~200 K entries after Houaiss-anchored enrichment; the heuristic enrichment script (scripts/cache_enrichment.py) adds another ~13 K surface forms on top of the Houaiss-extracted ~57 K, and a lexicon fallback covers the remaining headwords with self-roots. - Ablations are flag-toggled, not corpus-rebuilt.
baseline(no morphemic, no grammar bias),masha_ab(morphemic + POS), andmasha_abc(full MASHA) all read the same parquet; the cache decides which annotations to attach. Comparing architectures across N seeds takes a single command-line argument, not a re-encoding job.
Practical consequence: when we changed the morpheme decomposition heuristic this week and added 12 827 Houaiss-anchored cache entries (commit 8b2aa2c), the entire 10 K-article training corpus was reused unchanged. The next ablation reran in minutes of cache rebuild instead of hours of corpus rebuild.
Anchoring references
The full bibliography lives in docs/REFERENCES.md (11 sections, ~50 entries). Short list of the pillars that matter most to orient a reading:
- Strubell, Verga, Andor, Weiss, McCallum (2018) — Linguistically-Informed Self-Attention for SRL (EMNLP). Closest published parallel to MASHA's attention bias.
- Park et al. (2021) — Morphology Matters: A Multilingual LM Analysis (TACL). Empirical case for taking morphology seriously in morphologically rich languages.
- Brody, Alon, Yahav (2022) — How Attentive are Graph Attention Networks? (ICLR) — GATv2, the exact variant implemented in
masha_grad/gnn.py. - Souza, Nogueira, Lotufo (2020) — BERTimbau (BRACIS). The PT-BR baseline MASHA benchmarks against.
- Carmo et al. (2020) — PTT5. The other canonical PT-BR baseline.
- Hartmann et al. (2017) — Portuguese Word Embeddings (STIL, NILC). PT-BR reference embeddings.
- Houaiss, Villar, Franco (2009) — Dicionário Houaiss da Língua Portuguesa. Source of the morphemic decomposition.
- Qi, Zhang, Zhang, Bolton, Manning (2020) — Stanza (ACL Demo). POS + dependency parsing on the PT-Bosque UD treebank.
- Hoffmann et al. (2022) — Chinchilla + Kaplan et al. (2020). Scaling laws — why the nano negative result is coherent with the morphemic hypothesis still being open at the right token budget.
- Cotterell, Mielke, Eisner, Roark (2018) — Are All Languages Equally Hard to LM? (NAACL). BPC cross-language, the secondary metric MASHA reports alongside PPL.
- Garcez et al. (2019) + Mao et al. (2019) — neuro-symbolic computing umbrella.
Whitepaper
A working draft is in paper/WHITEPAPER.md. It is honest about state: Hypothesis A (morphemic signal visible at nano scale) is rejected statistically at the current n; Hypothesis B (signal needs ≥ 2 B tokens to surface) is still open and hardware-blocked. The contribution is framed as architectural + engineering + a reproducible ablation harness, not as a SOTA claim.
Reproducing the nano results
The whole nano ablation runs end-to-end from a single command after the repo is cloned and Postgres is loaded:
PYTHONPATH=. bash scripts/reproduce_nano.sh
The script (scripts/reproduce_nano.sh) hard-fails fast on any environmental gap (Python version, Postgres reachability, missing data files), runs the safety_guard smoke test before any expensive computation, launches the three-kind ablation across three seeds with the laptop-tuned watchdog defaults (RAM 85/92 %, GPU 82/88 °C), sanity-checks the output parquets, and prints the summary report. Override via environment variables:
SEEDS="44 45 46" STEPS=2000 OUT_DIR=experiments/repro_my_run \
PYTHONPATH=. bash scripts/reproduce_nano.sh
SKIP_TRAIN=1 re-runs only the analysis on existing parquets. The script is idempotent and re-runnable.
Phase 0 — Progressive Ladder (Marcos's notebook)
Hardware: NVIDIA RTX A5000 Laptop 16 GB / 31 GB RAM / 185 GB free NVMe.
| Rung | Params | Time | Purpose |
|---|---|---|---|
| Nano | ~125 M | ~12 h | Pipeline sanity check |
| Micro | ~350 M | ~3-5 d | PoC oficial vs Baseline |
| Small | ~1.3 B | ~2 wk | Confirmation before Phase 1 |
Each rung is compared head-to-head with a parameter-matched Standard Transformer baseline (same training data, same compute, only the neuro-symbolic stack differs).
Phase 1 — 32B at the datacenter
If MASHA-Small confirms the architectural hypothesis, scale to 32 B on 8× AMD MI300X 192 GB at the PopSolutions datacenter.
Plan
See .config/masha.md for the full plan (v0.2).
Repository
- Source: git.pop.coop/pop/MASHA
- License: CHARRUA v1.2
- Author: Marcos Mendez (PopSolutions)
Submodules
The three linguistic sources live in companion repositories, included as git submodules under data/sources/:
| Path | Submodule | What it provides |
|---|---|---|
data/sources/gramatica/ |
git.pop.coop/pop/gramatica | Cunha & Cintra grammar as plain markdown (~2.5 MB) |
data/sources/dicionario/ |
git.pop.coop/pop/dicionario | Houaiss lexicon — schema + reload pipeline (no copyrighted text) |
data/sources/etimologia/ |
git.pop.coop/pop/etimologia | 27 K etymologies + 42 K morphemes + 60 K word_morphemes (JSONL) |
Clone with submodules:
git clone --recurse-submodules https://git.pop.coop/pop/MASHA.git
# or, if you already cloned:
git submodule update --init --recursive