pop/MASHA

No description

Python 96.4%
TeX 1.9%
Shell 1.6%

Find a file

Marcos 1aa94f5537 docs(epsilon): n=6 BPC tokenization-fair — t=-163 vs vanilla BPE, -181 vs DeepSeek BPE, -169 vs Llama BPE (p<0.001) Phase EPSILON closes the methodological caveat from Phase DELTA (all kinds shared word-level vocab) by running the three SOTA training recipes on their native BPE 32K tokenization while keeping MASHA v0.3 on word-level. Comparison is on BPC (tokenization-fair). Headline (n=6, df=5, all paired): Δ masha_v3 - baseline_bpe: μ=-0.3528 σ=0.0053 t=-163.57 Δ masha_v3 - deepseek_bpe: μ=-0.3701 σ=0.0050 t=-181.13 Δ masha_v3 - llama_bpe: μ=-0.3525 σ=0.0051 t=-169.50 All three at two-tailed p<0.001 (crit \|t\|>6.87). 6/6 seeds favourable. The 0.35 BPC gap means MASHA encodes the same held-out text using ~14% fewer bits per character than any of the three BPE baselines, with the same compute and the same data. Cross-phase summary (six paired-t tests, all p<0.001): vanilla word (DELTA) : t=-10.46 DeepSeek word (DELTA) : t=-39.40 Llama word (DELTA) : t= -7.13 vanilla BPE (EPSILON): t=-163.57 DeepSeek BPE (EPSILON): t=-181.13 Llama BPE (EPSILON): t=-169.50 This is the strongest single result of the MASHA campaign so far. What remains: - Phase GAMMA - full Wikipedia PT (~1.5B tokens, scale test) - Phase BETA - sense-aware Stream B (polysemy) - Cloud robustness (issue #23)		2026-05-14 02:39:50 -03:00
.config	feat: bootstrap MASHA project skeleton with v0.2 plan	2026-05-10 20:12:04 -03:00
data	feat: second-pass etymology refinement (1.5% -> 22.7% coverage)	2026-05-11 00:07:06 -03:00
docs	docs(epsilon): n=6 BPC tokenization-fair — t=-163 vs vanilla BPE,	2026-05-14 02:39:50 -03:00
experiments	study v3: full neuro-symbolic stack (with grammar_bias) — Hypothesis A REJECTED	2026-05-11 11:52:22 -03:00
masha	feat(delta): Phase DELTA — MASHA v0.3 vs DeepSeek MTP & Llama recipes	2026-05-14 01:17:00 -03:00
masha_grad	feat(masha_grad): Passo B.1.3 — MixedObjectiveDataset for UL2-style training	2026-05-13 12:04:38 -03:00
migrations	feat(schema): Phase C — v1 compatibility VIEW over v2 schema	2026-05-12 20:42:07 -03:00
paper	docs(whitepaper): §7.4 + §8.1 update — post-rebuild n=6 paired-t	2026-05-13 02:24:39 -03:00
scripts	feat(epsilon): Phase EPSILON — BPE 32K tokenization-fair comparison	2026-05-14 02:01:56 -03:00
tests	feat: word-level tokenization + Tier-1 eval rig	2026-05-11 18:16:37 -03:00
tinygrad@6b0a9f5ee6	feat(masha-grad): bootstrap tinygrad submodule, masha_grad package, and CLAUDE.md	2026-05-10 23:52:52 -03:00
.gitignore	chore(docs): neutralize vendor-agent references, gate sprint plan as internal	2026-05-12 13:47:29 -03:00
.gitmodules	feat(masha-grad): bootstrap tinygrad submodule, masha_grad package, and CLAUDE.md	2026-05-10 23:52:52 -03:00
Agent_Smith.md	chore(docs): neutralize vendor-agent references, gate sprint plan as internal	2026-05-12 13:47:29 -03:00
alembic.ini	feat(db): replace sqlite3 with PostgreSQL via SQLAlchemy 2.0 + Alembic	2026-05-10 22:55:18 -03:00
pyproject.toml	feat(db): replace sqlite3 with PostgreSQL via SQLAlchemy 2.0 + Alembic	2026-05-10 22:55:18 -03:00
README.md	docs(readme): v0.3 project state + SOTA comparison roadmap	2026-05-13 20:13:16 -03:00

README.md

MASHA

Modelo de Arquitetura Semântica Hierárquica Avançada — a neuro-symbolic Portuguese language model that treats language as a hierarchical chemical system: characters → morphemes → words → phrases.

IA soberana começa por entender a própria língua.

Project state — 2026-05-13 (v0.3 refactor in progress)

The architecture is in the third refactor. Each iteration was triggered by a concrete empirical finding from the prior version:

Version	Architecture	Outcome that motivated the next iteration
v0.1	Pure baseline (vanilla decoder Transformer, no priors)	Reference point. Established the paired-t methodology and the ablation rig.
v0.2	Morphemic + grammar priors summed into the input embedding before the Transformer	Local paired-t n=6 mixed-objective showed Δabc = −41 PPL (two-tailed p<0.05). A cloud expedition on an H200 NVL exposed numerical fragility — the sign of the effect inverted under cu130 vs cu128. See `docs/CLOUD_EXPEDITION_REPORT.md`.
v0.3 (current)	Two separated streams: distributional LLM + symbolic prior, fused at logit level by a learned gate	Mathematical lower bound: if the gate goes to 1, MASHA degenerates exactly into the baseline. Cannot be strictly worse than baseline. See `docs/MASHA_V3_DESIGN.md`.

Today's success criterion (the one we actually care about):

A MASHA model learning Wikipedia PT must outperform a parameter-matched LLM trained with state-of-the-art recipes (DeepSeek-style, Llama-style, GPT-style) on the same Wikipedia PT, under the same compute budget, on the same paired seeds.

That is what v0.3 is built to test. Apples-to-apples, same machine (A5000 16 GB), same data, same gradient steps, same eval rig. The paired-t framework of Passos A/B/C already implements this — v0.3 only changes the architecture under test, and the ablation rig is being extended to add SOTA training-recipe baselines (not just vanilla Transformer).

v0.3 in one diagram

Stream A — Distributional LLM (bit-identical to baseline)
   word_ids → token_embed → 12-layer Transformer → lm_head → logits_A [V]

Stream B — Symbolic prior (NEW)
   word_ids → lookup(root, prefix, suffix, POS) → MLP → logits_B [V]
   (uses morphemic + grammar features, NOT the token id itself)

Gate (NEW)
   α = sigmoid(linear(h_A))  ∈ [0, 1] per position
   initialized so α ≈ 0.95 at step 0 (model behaves like baseline initially)

Fusion at logits
   logits = α · logits_A + (1 − α) · logits_B

The architectural property v0.2 lacked: when α → 1 everywhere, logits = logits_A = baseline(x). The gradient is free to drive α to 1 wherever Stream B is noise, so v0.3's loss is upper-bounded by the baseline's loss in the limit of sufficient training.

Where we are this week

✅ v0.3 design doc: docs/MASHA_V3_DESIGN.md
✅ v0.3 implementation: masha/model/masha_v3.py (with SymbolicStream, gate, fused forward)
✅ Training-loop integration: scripts/run_ablation_word.py ships a fourth kind masha_v3 alongside baseline / masha_ab / masha_abc
✅ Smoke test: 200 steps × seed 41, v0.3 produced lower PPL and higher top-1 than the baseline on the first run (lower-bound guarantee respected; sign already favourable at init)
🔄 n=3 paired-t at local scale (256 K tokens/cell)
⏳ n=6 mixed-objective paired-t (the regime where v0.2 locally crossed p<0.05)
⏳ Phase BETA: sense-aware Stream B (polysemy, literal vs figurative — see design doc)
⏳ Phase GAMMA: expand training corpus from 40 MB Wikipedia sample to full Wikipedia PT (~1.5 B tokens / 5 GB)
⏳ Phase DELTA: comparison ablation against SOTA training recipes (we have to measure ourselves against the big players, not just a vanilla baseline) — see "Methodological comparison" below

Methodological comparison — measuring against SOTA recipes

A vanilla Transformer baseline is the easy yardstick. The honest yardstick is modern LLM training recipes as published by DeepSeek, Llama, Mistral, Qwen — the techniques used by the systems that actually set the state of the art.

For our nano-scale paired-t rig, the comparison kinds we are extending the ablation matrix with:

Kind	What it adds vs vanilla baseline	Reference
`baseline`	vanilla causal LM, AdamW, cosine LR	(reference point)
`baseline_deepseek`	Multi-Token Prediction objective (predict next K tokens, not just 1), DeepSeek-style LR schedule, weight init	DeepSeek-V3 technical report, 2024
`baseline_llama`	RoPE θ=500k, SwiGLU + slightly different init, attention dropout policy from Llama 3	Llama 3 paper, 2024
`masha_abc` (v0.2)	Morphemic + grammar fused into input embedding	This repo, `docs/PASSO_B_MIXED_OBJECTIVE_DESIGN.md`
`masha_v3` (v0.3)	Two-stream priors + gated logit fusion	This repo, `docs/MASHA_V3_DESIGN.md`

The point: if masha_v3 beats baseline_deepseek on paired-t over the same data and compute, then MASHA has a publishable story. Beating only baseline is the bare minimum.

This comparison ablation is Phase DELTA. Implementation tracked in scripts/run_ablation_word.py once Phase ALPHA (current v0.3 paired-t) and Phase BETA (sense-aware) land.

What makes MASHA different

Standard LLMs treat language as a flat sequence of BPE tokens learned statistically from billions of examples. MASHA encodes explicit linguistic structure as an inductive bias:

Level	What it captures	Where the knowledge comes from
Quantum	Characters, punctuation, accents	Defined a priori from grammatical theory
Morphemic	Etymological roots, prefixes, suffixes	Houaiss `étim` field (submodule `git.pop.coop/pop/etimologia`)
Atomic	Words composed of morphemes + grammatical class	Houaiss dictionary (submodule `git.pop.coop/pop/dicionario`)
Molecular	Phrases & clauses bound by grammatical relations	Cunha & Cintra (submodule `git.pop.coop/pop/gramatica`)

The Morphemic level is the key compression: avião, aviador, aviação all share the root AVI (Latin avis "ave"), so the model learns one embedding for the root and composes it with suffix embeddings. ~5 K roots × ~50 prefixes × ~150 suffixes cover ~80 % of PT-BR vocabulary.

Architecture at a glance

flowchart LR
    subgraph CORPUS["Wikipedia PT (parquet, plain text only)"]
        TXT["id · text"]
    end

    subgraph PG["Postgres — symbolic substrate"]
        MORPH["morphemes<br/>~42 K ROOT/PREFIX/SUFFIX"]
        WM["word_morphemes<br/>~60 K decompositions<br/>(GIN-indexed arrays)"]
        LEX["lexicon_entry<br/>Houaiss headwords"]
    end

    subgraph CACHE["WordDecompCache (in-memory, loaded once)"]
        ROOT["root_id_of[word]"]
        PFX["prefix_ids_of[word]"]
        SFX["suffix_ids_of[word]"]
        GC["grammar_class_id(word)"]
    end

    subgraph DEPS["Stanza deps (parquet)"]
        DP["per-article<br/>dependency arcs"]
    end

    subgraph MODEL["MASHA model (decoder-only Transformer + GNN)"]
        direction TB
        E1["E_token (atomic)"]
        E2["E_root + Σ E_prefix + Σ E_suffix (morphemic)"]
        E3["E_gclass (POS-tag)"]
        SUM(("⊕"))
        ATTN["Self-attention<br/>+ dependency bias matrix B"]
        GNN["GATv2 over dep graph<br/>(grammar layer)"]
        OUT["Linear → vocab"]
    end

    TXT -- "lookup at train time" --> ROOT
    TXT -- "lookup at train time" --> PFX
    TXT -- "lookup at train time" --> SFX
    TXT -- "lookup at train time" --> GC
    MORPH --> ROOT
    WM --> ROOT
    WM --> PFX
    WM --> SFX
    LEX --> ROOT
    ROOT --> E2
    PFX --> E2
    SFX --> E2
    GC --> E3
    TXT --> E1
    E1 --> SUM
    E2 --> SUM
    E3 --> SUM
    SUM --> ATTN
    DP --> ATTN
    DP --> GNN
    ATTN --> GNN
    GNN --> OUT

The dashed-looking flow on the left is the Postgres trick — the corpus parquet has no annotations baked in; the cache pulls them from Postgres at the start of training. Ablations toggle which annotations are read, not which corpus is loaded. The right side is the model itself: token / morphemic / POS embeddings are summed at the input; self-attention runs with an optional dependency-bias matrix; a GATv2 GNN passes messages over the dependency graph in parallel.

Position in the literature

How MASHA relates to the closest published work on syntax- and morphology-aware language models:

Aspect	TGs (DeepMind, 2022)	DTGs (ShanghaiTech, 2024)	Oseki et al. (Tokyo)	LISA (Strubell, 2018)	MASHA
Structure type	Constituency	Dependency	Both	Dependency (SRL)	Dependency
Integration	Attention mask	arc-eager / arc-standard transitions	Implicit supervision	Supervised attention head	GNN + attention bias + morphemic layer
Explicit grammar weight	No	No	No	No	Yes (POS-tag embedding)
Character-level ("quantum") layer	No	No	No	No	Yes
GNN encoder	No	No	No	No	Yes (GATv2, Brody et al. 2022)
Neuro-symbolic	Partial	Partial	No	Partial	Yes (Houaiss as a priori symbol)
Currently validated scale	Medium	Small	Small	Small	Nano (125-250 M, on a laptop A5000 16 GB)
Roadmap target	—	—	—	—	Micro 350 M → Small 1.3 B (hardware-blocked)

The combination no published work has assembled, as far as we have found, is: GNN over the dependency graph + a priori grammatical weights (not learned from scratch) + structural attention bias + etymological morphemic layer. At the current nano scale the effect is in the expected direction but lacks statistical power (~1 % PPL reduction within seed variance). The table describes what is architectural and reproducible today; the "does it actually beat all of these" question depends on the micro / small runs that need hardware MASHA does not yet have.

How training data flows — the Postgres trick

A standard ML pipeline pre-vectorises the corpus once: tokenise, attach every annotation (POS tag, dependency edges, morpheme IDs, …) and dump the whole thing as fixed tensors (Parquet / Arrow / HF Datasets). Every change to an annotation forces a full corpus rebuild — and at PT-BR Wikipedia scale that costs hours of Stanza dependency parsing and gigabytes of disk.

MASHA refuses that bargain.

The corpus stays plain text. data/processed/wikipedia_pt_sample.parquet carries (id, text) only — no token IDs, no morpheme IDs, no POS tags. Re-decomposing the morphemes never touches it.
Annotations live in Postgres, normalised and indexed. Three tables (morphemes, word_morphemes, lexicon_entry) hold the symbolic substrate: ~42 K morphemes (ROOT / PREFIX / SUFFIX), ~60 K Houaiss-anchored decompositions, and the full lexicon as fallback. GIN indexes on the prefix_ids[] / suffix_ids[] array columns make queries fast (masha/db/models.py).
At training start, WordDecompCache.load_from_pg() builds a flat in-memory cache. Per-batch lookups are O(1) on hash maps (root_id_of[word], prefix_ids_of[word], suffix_ids_of[word], grammar_class_id(word)). The cache is ~200 K entries after Houaiss-anchored enrichment; the heuristic enrichment script (scripts/cache_enrichment.py) adds another ~13 K surface forms on top of the Houaiss-extracted ~57 K, and a lexicon fallback covers the remaining headwords with self-roots.
Ablations are flag-toggled, not corpus-rebuilt. baseline (no morphemic, no grammar bias), masha_ab (morphemic + POS), and masha_abc (full MASHA) all read the same parquet; the cache decides which annotations to attach. Comparing architectures across N seeds takes a single command-line argument, not a re-encoding job.

Practical consequence: when we changed the morpheme decomposition heuristic this week and added 12 827 Houaiss-anchored cache entries (commit 8b2aa2c), the entire 10 K-article training corpus was reused unchanged. The next ablation reran in minutes of cache rebuild instead of hours of corpus rebuild.

Anchoring references

The full bibliography lives in docs/REFERENCES.md (11 sections, ~50 entries). Short list of the pillars that matter most to orient a reading:

Strubell, Verga, Andor, Weiss, McCallum (2018) — Linguistically-Informed Self-Attention for SRL (EMNLP). Closest published parallel to MASHA's attention bias.
Park et al. (2021) — Morphology Matters: A Multilingual LM Analysis (TACL). Empirical case for taking morphology seriously in morphologically rich languages.
Brody, Alon, Yahav (2022) — How Attentive are Graph Attention Networks? (ICLR) — GATv2, the exact variant implemented in masha_grad/gnn.py.
Souza, Nogueira, Lotufo (2020) — BERTimbau (BRACIS). The PT-BR baseline MASHA benchmarks against.
Carmo et al. (2020) — PTT5. The other canonical PT-BR baseline.
Hartmann et al. (2017) — Portuguese Word Embeddings (STIL, NILC). PT-BR reference embeddings.
Houaiss, Villar, Franco (2009) — Dicionário Houaiss da Língua Portuguesa. Source of the morphemic decomposition.
Qi, Zhang, Zhang, Bolton, Manning (2020) — Stanza (ACL Demo). POS + dependency parsing on the PT-Bosque UD treebank.
Hoffmann et al. (2022) — Chinchilla + Kaplan et al. (2020). Scaling laws — why the nano negative result is coherent with the morphemic hypothesis still being open at the right token budget.
Cotterell, Mielke, Eisner, Roark (2018) — Are All Languages Equally Hard to LM? (NAACL). BPC cross-language, the secondary metric MASHA reports alongside PPL.
Garcez et al. (2019) + Mao et al. (2019) — neuro-symbolic computing umbrella.

Whitepaper

A working draft is in paper/WHITEPAPER.md. It is honest about state: Hypothesis A (morphemic signal visible at nano scale) is rejected statistically at the current n; Hypothesis B (signal needs ≥ 2 B tokens to surface) is still open and hardware-blocked. The contribution is framed as architectural + engineering + a reproducible ablation harness, not as a SOTA claim.

Reproducing the nano results

The whole nano ablation runs end-to-end from a single command after the repo is cloned and Postgres is loaded:

PYTHONPATH=. bash scripts/reproduce_nano.sh

The script (scripts/reproduce_nano.sh) hard-fails fast on any environmental gap (Python version, Postgres reachability, missing data files), runs the safety_guard smoke test before any expensive computation, launches the three-kind ablation across three seeds with the laptop-tuned watchdog defaults (RAM 85/92 %, GPU 82/88 °C), sanity-checks the output parquets, and prints the summary report. Override via environment variables:

SEEDS="44 45 46" STEPS=2000 OUT_DIR=experiments/repro_my_run \
    PYTHONPATH=. bash scripts/reproduce_nano.sh

SKIP_TRAIN=1 re-runs only the analysis on existing parquets. The script is idempotent and re-runnable.

Phase 0 — Progressive Ladder (Marcos's notebook)

Hardware: NVIDIA RTX A5000 Laptop 16 GB / 31 GB RAM / 185 GB free NVMe.

Rung	Params	Time	Purpose
Nano	~125 M	~12 h	Pipeline sanity check
Micro	~350 M	~3-5 d	PoC oficial vs Baseline
Small	~1.3 B	~2 wk	Confirmation before Phase 1

Each rung is compared head-to-head with a parameter-matched Standard Transformer baseline (same training data, same compute, only the neuro-symbolic stack differs).

Phase 1 — 32B at the datacenter

If MASHA-Small confirms the architectural hypothesis, scale to 32 B on 8× AMD MI300X 192 GB at the PopSolutions datacenter.

Plan

See .config/masha.md for the full plan (v0.2).

Repository

Source: git.pop.coop/pop/MASHA
License: CHARRUA v1.2
Author: Marcos Mendez (PopSolutions)

Submodules

The three linguistic sources live in companion repositories, included as git submodules under data/sources/:

Path	Submodule	What it provides
`data/sources/gramatica/`	git.pop.coop/pop/gramatica	Cunha & Cintra grammar as plain markdown (~2.5 MB)
`data/sources/dicionario/`	git.pop.coop/pop/dicionario	Houaiss lexicon — schema + reload pipeline (no copyrighted text)
`data/sources/etimologia/`	git.pop.coop/pop/etimologia	27 K etymologies + 42 K morphemes + 60 K word_morphemes (JSONL)

Clone with submodules:

git clone --recurse-submodules https://git.pop.coop/pop/MASHA.git
# or, if you already cloned:
git submodule update --init --recursive

README.md Unescape Escape