No description
  • Python 96.4%
  • TeX 1.9%
  • Shell 1.6%
Find a file
Marcos 1aa94f5537 docs(epsilon): n=6 BPC tokenization-fair — t=-163 vs vanilla BPE,
-181 vs DeepSeek BPE, -169 vs Llama BPE (p<0.001)

Phase EPSILON closes the methodological caveat from Phase DELTA
(all kinds shared word-level vocab) by running the three SOTA
training recipes on their native BPE 32K tokenization while keeping
MASHA v0.3 on word-level. Comparison is on BPC (tokenization-fair).

Headline (n=6, df=5, all paired):
  Δ masha_v3 - baseline_bpe:        μ=-0.3528  σ=0.0053  t=-163.57
  Δ masha_v3 - deepseek_bpe:        μ=-0.3701  σ=0.0050  t=-181.13
  Δ masha_v3 - llama_bpe:           μ=-0.3525  σ=0.0051  t=-169.50

All three at two-tailed p<0.001 (crit |t|>6.87). 6/6 seeds
favourable. The 0.35 BPC gap means MASHA encodes the same held-out
text using ~14% fewer bits per character than any of the three BPE
baselines, with the same compute and the same data.

Cross-phase summary (six paired-t tests, all p<0.001):
  vanilla word    (DELTA)  : t=-10.46
  DeepSeek word   (DELTA)  : t=-39.40
  Llama word      (DELTA)  : t= -7.13
  vanilla BPE     (EPSILON): t=-163.57
  DeepSeek BPE    (EPSILON): t=-181.13
  Llama BPE       (EPSILON): t=-169.50

This is the strongest single result of the MASHA campaign so far.

What remains:
  - Phase GAMMA  - full Wikipedia PT (~1.5B tokens, scale test)
  - Phase BETA   - sense-aware Stream B (polysemy)
  - Cloud robustness (issue #23)
2026-05-14 02:39:50 -03:00
.config feat: bootstrap MASHA project skeleton with v0.2 plan 2026-05-10 20:12:04 -03:00
data feat: second-pass etymology refinement (1.5% -> 22.7% coverage) 2026-05-11 00:07:06 -03:00
docs docs(epsilon): n=6 BPC tokenization-fair — t=-163 vs vanilla BPE, 2026-05-14 02:39:50 -03:00
experiments study v3: full neuro-symbolic stack (with grammar_bias) — Hypothesis A REJECTED 2026-05-11 11:52:22 -03:00
masha feat(delta): Phase DELTA — MASHA v0.3 vs DeepSeek MTP & Llama recipes 2026-05-14 01:17:00 -03:00
masha_grad feat(masha_grad): Passo B.1.3 — MixedObjectiveDataset for UL2-style training 2026-05-13 12:04:38 -03:00
migrations feat(schema): Phase C — v1 compatibility VIEW over v2 schema 2026-05-12 20:42:07 -03:00
paper docs(whitepaper): §7.4 + §8.1 update — post-rebuild n=6 paired-t 2026-05-13 02:24:39 -03:00
scripts feat(epsilon): Phase EPSILON — BPE 32K tokenization-fair comparison 2026-05-14 02:01:56 -03:00
tests feat: word-level tokenization + Tier-1 eval rig 2026-05-11 18:16:37 -03:00
tinygrad@6b0a9f5ee6 feat(masha-grad): bootstrap tinygrad submodule, masha_grad package, and CLAUDE.md 2026-05-10 23:52:52 -03:00
.gitignore chore(docs): neutralize vendor-agent references, gate sprint plan as internal 2026-05-12 13:47:29 -03:00
.gitmodules feat(masha-grad): bootstrap tinygrad submodule, masha_grad package, and CLAUDE.md 2026-05-10 23:52:52 -03:00
Agent_Smith.md chore(docs): neutralize vendor-agent references, gate sprint plan as internal 2026-05-12 13:47:29 -03:00
alembic.ini feat(db): replace sqlite3 with PostgreSQL via SQLAlchemy 2.0 + Alembic 2026-05-10 22:55:18 -03:00
pyproject.toml feat(db): replace sqlite3 with PostgreSQL via SQLAlchemy 2.0 + Alembic 2026-05-10 22:55:18 -03:00
README.md docs(readme): v0.3 project state + SOTA comparison roadmap 2026-05-13 20:13:16 -03:00

MASHA

Modelo de Arquitetura Semântica Hierárquica Avançada — a neuro-symbolic Portuguese language model that treats language as a hierarchical chemical system: characters → morphemes → words → phrases.

IA soberana começa por entender a própria língua.

Project state — 2026-05-13 (v0.3 refactor in progress)

The architecture is in the third refactor. Each iteration was triggered by a concrete empirical finding from the prior version:

Version Architecture Outcome that motivated the next iteration
v0.1 Pure baseline (vanilla decoder Transformer, no priors) Reference point. Established the paired-t methodology and the ablation rig.
v0.2 Morphemic + grammar priors summed into the input embedding before the Transformer Local paired-t n=6 mixed-objective showed Δabc = 41 PPL (two-tailed p<0.05). A cloud expedition on an H200 NVL exposed numerical fragility — the sign of the effect inverted under cu130 vs cu128. See docs/CLOUD_EXPEDITION_REPORT.md.
v0.3 (current) Two separated streams: distributional LLM + symbolic prior, fused at logit level by a learned gate Mathematical lower bound: if the gate goes to 1, MASHA degenerates exactly into the baseline. Cannot be strictly worse than baseline. See docs/MASHA_V3_DESIGN.md.

Today's success criterion (the one we actually care about):

A MASHA model learning Wikipedia PT must outperform a parameter-matched LLM trained with state-of-the-art recipes (DeepSeek-style, Llama-style, GPT-style) on the same Wikipedia PT, under the same compute budget, on the same paired seeds.

That is what v0.3 is built to test. Apples-to-apples, same machine (A5000 16 GB), same data, same gradient steps, same eval rig. The paired-t framework of Passos A/B/C already implements this — v0.3 only changes the architecture under test, and the ablation rig is being extended to add SOTA training-recipe baselines (not just vanilla Transformer).

v0.3 in one diagram

Stream A — Distributional LLM (bit-identical to baseline)
   word_ids → token_embed → 12-layer Transformer → lm_head → logits_A [V]

Stream B — Symbolic prior (NEW)
   word_ids → lookup(root, prefix, suffix, POS) → MLP → logits_B [V]
   (uses morphemic + grammar features, NOT the token id itself)

Gate (NEW)
   α = sigmoid(linear(h_A))  ∈ [0, 1] per position
   initialized so α ≈ 0.95 at step 0 (model behaves like baseline initially)

Fusion at logits
   logits = α · logits_A + (1  α) · logits_B

The architectural property v0.2 lacked: when α → 1 everywhere, logits = logits_A = baseline(x). The gradient is free to drive α to 1 wherever Stream B is noise, so v0.3's loss is upper-bounded by the baseline's loss in the limit of sufficient training.

Where we are this week

  • v0.3 design doc: docs/MASHA_V3_DESIGN.md
  • v0.3 implementation: masha/model/masha_v3.py (with SymbolicStream, gate, fused forward)
  • Training-loop integration: scripts/run_ablation_word.py ships a fourth kind masha_v3 alongside baseline / masha_ab / masha_abc
  • Smoke test: 200 steps × seed 41, v0.3 produced lower PPL and higher top-1 than the baseline on the first run (lower-bound guarantee respected; sign already favourable at init)
  • 🔄 n=3 paired-t at local scale (256 K tokens/cell)
  • n=6 mixed-objective paired-t (the regime where v0.2 locally crossed p<0.05)
  • Phase BETA: sense-aware Stream B (polysemy, literal vs figurative — see design doc)
  • Phase GAMMA: expand training corpus from 40 MB Wikipedia sample to full Wikipedia PT (~1.5 B tokens / 5 GB)
  • Phase DELTA: comparison ablation against SOTA training recipes (we have to measure ourselves against the big players, not just a vanilla baseline) — see "Methodological comparison" below

Methodological comparison — measuring against SOTA recipes

A vanilla Transformer baseline is the easy yardstick. The honest yardstick is modern LLM training recipes as published by DeepSeek, Llama, Mistral, Qwen — the techniques used by the systems that actually set the state of the art.

For our nano-scale paired-t rig, the comparison kinds we are extending the ablation matrix with:

Kind What it adds vs vanilla baseline Reference
baseline vanilla causal LM, AdamW, cosine LR (reference point)
baseline_deepseek Multi-Token Prediction objective (predict next K tokens, not just 1), DeepSeek-style LR schedule, weight init DeepSeek-V3 technical report, 2024
baseline_llama RoPE θ=500k, SwiGLU + slightly different init, attention dropout policy from Llama 3 Llama 3 paper, 2024
masha_abc (v0.2) Morphemic + grammar fused into input embedding This repo, docs/PASSO_B_MIXED_OBJECTIVE_DESIGN.md
masha_v3 (v0.3) Two-stream priors + gated logit fusion This repo, docs/MASHA_V3_DESIGN.md

The point: if masha_v3 beats baseline_deepseek on paired-t over the same data and compute, then MASHA has a publishable story. Beating only baseline is the bare minimum.

This comparison ablation is Phase DELTA. Implementation tracked in scripts/run_ablation_word.py once Phase ALPHA (current v0.3 paired-t) and Phase BETA (sense-aware) land.

What makes MASHA different

Standard LLMs treat language as a flat sequence of BPE tokens learned statistically from billions of examples. MASHA encodes explicit linguistic structure as an inductive bias:

Level What it captures Where the knowledge comes from
Quantum Characters, punctuation, accents Defined a priori from grammatical theory
Morphemic Etymological roots, prefixes, suffixes Houaiss étim field (submodule git.pop.coop/pop/etimologia)
Atomic Words composed of morphemes + grammatical class Houaiss dictionary (submodule git.pop.coop/pop/dicionario)
Molecular Phrases & clauses bound by grammatical relations Cunha & Cintra (submodule git.pop.coop/pop/gramatica)

The Morphemic level is the key compression: avião, aviador, aviação all share the root AVI (Latin avis "ave"), so the model learns one embedding for the root and composes it with suffix embeddings. ~5 K roots × ~50 prefixes × ~150 suffixes cover ~80 % of PT-BR vocabulary.

Architecture at a glance

flowchart LR
    subgraph CORPUS["Wikipedia PT (parquet, plain text only)"]
        TXT["id · text"]
    end

    subgraph PG["Postgres — symbolic substrate"]
        MORPH["morphemes<br/>~42 K ROOT/PREFIX/SUFFIX"]
        WM["word_morphemes<br/>~60 K decompositions<br/>(GIN-indexed arrays)"]
        LEX["lexicon_entry<br/>Houaiss headwords"]
    end

    subgraph CACHE["WordDecompCache (in-memory, loaded once)"]
        ROOT["root_id_of[word]"]
        PFX["prefix_ids_of[word]"]
        SFX["suffix_ids_of[word]"]
        GC["grammar_class_id(word)"]
    end

    subgraph DEPS["Stanza deps (parquet)"]
        DP["per-article<br/>dependency arcs"]
    end

    subgraph MODEL["MASHA model (decoder-only Transformer + GNN)"]
        direction TB
        E1["E_token (atomic)"]
        E2["E_root + Σ E_prefix + Σ E_suffix (morphemic)"]
        E3["E_gclass (POS-tag)"]
        SUM(("⊕"))
        ATTN["Self-attention<br/>+ dependency bias matrix B"]
        GNN["GATv2 over dep graph<br/>(grammar layer)"]
        OUT["Linear → vocab"]
    end

    TXT -- "lookup at train time" --> ROOT
    TXT -- "lookup at train time" --> PFX
    TXT -- "lookup at train time" --> SFX
    TXT -- "lookup at train time" --> GC
    MORPH --> ROOT
    WM --> ROOT
    WM --> PFX
    WM --> SFX
    LEX --> ROOT
    ROOT --> E2
    PFX --> E2
    SFX --> E2
    GC --> E3
    TXT --> E1
    E1 --> SUM
    E2 --> SUM
    E3 --> SUM
    SUM --> ATTN
    DP --> ATTN
    DP --> GNN
    ATTN --> GNN
    GNN --> OUT

The dashed-looking flow on the left is the Postgres trick — the corpus parquet has no annotations baked in; the cache pulls them from Postgres at the start of training. Ablations toggle which annotations are read, not which corpus is loaded. The right side is the model itself: token / morphemic / POS embeddings are summed at the input; self-attention runs with an optional dependency-bias matrix; a GATv2 GNN passes messages over the dependency graph in parallel.

Position in the literature

How MASHA relates to the closest published work on syntax- and morphology-aware language models:

Aspect TGs (DeepMind, 2022) DTGs (ShanghaiTech, 2024) Oseki et al. (Tokyo) LISA (Strubell, 2018) MASHA
Structure type Constituency Dependency Both Dependency (SRL) Dependency
Integration Attention mask arc-eager / arc-standard transitions Implicit supervision Supervised attention head GNN + attention bias + morphemic layer
Explicit grammar weight No No No No Yes (POS-tag embedding)
Character-level ("quantum") layer No No No No Yes
GNN encoder No No No No Yes (GATv2, Brody et al. 2022)
Neuro-symbolic Partial Partial No Partial Yes (Houaiss as a priori symbol)
Currently validated scale Medium Small Small Small Nano (125-250 M, on a laptop A5000 16 GB)
Roadmap target Micro 350 M → Small 1.3 B (hardware-blocked)

The combination no published work has assembled, as far as we have found, is: GNN over the dependency graph + a priori grammatical weights (not learned from scratch) + structural attention bias + etymological morphemic layer. At the current nano scale the effect is in the expected direction but lacks statistical power (~1 % PPL reduction within seed variance). The table describes what is architectural and reproducible today; the "does it actually beat all of these" question depends on the micro / small runs that need hardware MASHA does not yet have.

How training data flows — the Postgres trick

A standard ML pipeline pre-vectorises the corpus once: tokenise, attach every annotation (POS tag, dependency edges, morpheme IDs, …) and dump the whole thing as fixed tensors (Parquet / Arrow / HF Datasets). Every change to an annotation forces a full corpus rebuild — and at PT-BR Wikipedia scale that costs hours of Stanza dependency parsing and gigabytes of disk.

MASHA refuses that bargain.

  • The corpus stays plain text. data/processed/wikipedia_pt_sample.parquet carries (id, text) only — no token IDs, no morpheme IDs, no POS tags. Re-decomposing the morphemes never touches it.
  • Annotations live in Postgres, normalised and indexed. Three tables (morphemes, word_morphemes, lexicon_entry) hold the symbolic substrate: ~42 K morphemes (ROOT / PREFIX / SUFFIX), ~60 K Houaiss-anchored decompositions, and the full lexicon as fallback. GIN indexes on the prefix_ids[] / suffix_ids[] array columns make queries fast (masha/db/models.py).
  • At training start, WordDecompCache.load_from_pg() builds a flat in-memory cache. Per-batch lookups are O(1) on hash maps (root_id_of[word], prefix_ids_of[word], suffix_ids_of[word], grammar_class_id(word)). The cache is ~200 K entries after Houaiss-anchored enrichment; the heuristic enrichment script (scripts/cache_enrichment.py) adds another ~13 K surface forms on top of the Houaiss-extracted ~57 K, and a lexicon fallback covers the remaining headwords with self-roots.
  • Ablations are flag-toggled, not corpus-rebuilt. baseline (no morphemic, no grammar bias), masha_ab (morphemic + POS), and masha_abc (full MASHA) all read the same parquet; the cache decides which annotations to attach. Comparing architectures across N seeds takes a single command-line argument, not a re-encoding job.

Practical consequence: when we changed the morpheme decomposition heuristic this week and added 12 827 Houaiss-anchored cache entries (commit 8b2aa2c), the entire 10 K-article training corpus was reused unchanged. The next ablation reran in minutes of cache rebuild instead of hours of corpus rebuild.

Anchoring references

The full bibliography lives in docs/REFERENCES.md (11 sections, ~50 entries). Short list of the pillars that matter most to orient a reading:

  • Strubell, Verga, Andor, Weiss, McCallum (2018)Linguistically-Informed Self-Attention for SRL (EMNLP). Closest published parallel to MASHA's attention bias.
  • Park et al. (2021)Morphology Matters: A Multilingual LM Analysis (TACL). Empirical case for taking morphology seriously in morphologically rich languages.
  • Brody, Alon, Yahav (2022)How Attentive are Graph Attention Networks? (ICLR) — GATv2, the exact variant implemented in masha_grad/gnn.py.
  • Souza, Nogueira, Lotufo (2020)BERTimbau (BRACIS). The PT-BR baseline MASHA benchmarks against.
  • Carmo et al. (2020)PTT5. The other canonical PT-BR baseline.
  • Hartmann et al. (2017)Portuguese Word Embeddings (STIL, NILC). PT-BR reference embeddings.
  • Houaiss, Villar, Franco (2009)Dicionário Houaiss da Língua Portuguesa. Source of the morphemic decomposition.
  • Qi, Zhang, Zhang, Bolton, Manning (2020)Stanza (ACL Demo). POS + dependency parsing on the PT-Bosque UD treebank.
  • Hoffmann et al. (2022) — Chinchilla + Kaplan et al. (2020). Scaling laws — why the nano negative result is coherent with the morphemic hypothesis still being open at the right token budget.
  • Cotterell, Mielke, Eisner, Roark (2018)Are All Languages Equally Hard to LM? (NAACL). BPC cross-language, the secondary metric MASHA reports alongside PPL.
  • Garcez et al. (2019) + Mao et al. (2019) — neuro-symbolic computing umbrella.

Whitepaper

A working draft is in paper/WHITEPAPER.md. It is honest about state: Hypothesis A (morphemic signal visible at nano scale) is rejected statistically at the current n; Hypothesis B (signal needs ≥ 2 B tokens to surface) is still open and hardware-blocked. The contribution is framed as architectural + engineering + a reproducible ablation harness, not as a SOTA claim.

Reproducing the nano results

The whole nano ablation runs end-to-end from a single command after the repo is cloned and Postgres is loaded:

PYTHONPATH=. bash scripts/reproduce_nano.sh

The script (scripts/reproduce_nano.sh) hard-fails fast on any environmental gap (Python version, Postgres reachability, missing data files), runs the safety_guard smoke test before any expensive computation, launches the three-kind ablation across three seeds with the laptop-tuned watchdog defaults (RAM 85/92 %, GPU 82/88 °C), sanity-checks the output parquets, and prints the summary report. Override via environment variables:

SEEDS="44 45 46" STEPS=2000 OUT_DIR=experiments/repro_my_run \
    PYTHONPATH=. bash scripts/reproduce_nano.sh

SKIP_TRAIN=1 re-runs only the analysis on existing parquets. The script is idempotent and re-runnable.

Phase 0 — Progressive Ladder (Marcos's notebook)

Hardware: NVIDIA RTX A5000 Laptop 16 GB / 31 GB RAM / 185 GB free NVMe.

Rung Params Time Purpose
Nano ~125 M ~12 h Pipeline sanity check
Micro ~350 M ~3-5 d PoC oficial vs Baseline
Small ~1.3 B ~2 wk Confirmation before Phase 1

Each rung is compared head-to-head with a parameter-matched Standard Transformer baseline (same training data, same compute, only the neuro-symbolic stack differs).

Phase 1 — 32B at the datacenter

If MASHA-Small confirms the architectural hypothesis, scale to 32 B on 8× AMD MI300X 192 GB at the PopSolutions datacenter.

Plan

See .config/masha.md for the full plan (v0.2).

Repository

Submodules

The three linguistic sources live in companion repositories, included as git submodules under data/sources/:

Path Submodule What it provides
data/sources/gramatica/ git.pop.coop/pop/gramatica Cunha & Cintra grammar as plain markdown (~2.5 MB)
data/sources/dicionario/ git.pop.coop/pop/dicionario Houaiss lexicon — schema + reload pipeline (no copyrighted text)
data/sources/etimologia/ git.pop.coop/pop/etimologia 27 K etymologies + 42 K morphemes + 60 K word_morphemes (JSONL)

Clone with submodules:

git clone --recurse-submodules https://git.pop.coop/pop/MASHA.git
# or, if you already cloned:
git submodule update --init --recursive