Nova Gramática do Português Contemporâneo (Cunha & Cintra, 7ª ed. 2017, Lexikon) — plain-text conversion for MASHA
Conversion via PyMuPDF with TEXT_INHIBIT_SPACES + TEXT_DEHYPHENATE flags — 795 pages, ~1.2 MB clean text. |
||
|---|---|---|
| text | ||
| README.md | ||
Nova Gramática do Português Contemporâneo
Plain-text (Markdown) conversion of:
Celso Cunha & Lindley Cintra — Nova Gramática do Português Contemporâneo, 7ª edição, 2017 — Lexikon Editora Digital — ISBN 9788583000310.
Used by the MASHA project as the source of grammatical rules and weights.
Layout
text/
├── nova_gramatica_full.md — whole book in one file (~1.2 MB)
└── pages_NNNN-MMMM.md — 50-page chunks for easier diffing
Provenance
- Source PDF: 47.8 MB, 795 pages
- Extracted with PyMuPDF (
fitz) usingTEXT_DEHYPHENATE | TEXT_INHIBIT_SPACESflags to defeat Lexikon's letter-spaced typesetting (e.g. "NOVA G R A M Á T IC A" → "NOVA GRAMÁTICA") - Soft-hyphens (U+00AD) stripped after dehyphenation
- See MASHA
scripts/extract_grammar.pyfor the exact pipeline
License
Conversion script: CHARRUA v1.2. Source text remains property of the original publisher (Lexikon Editora Digital). This repository is for research use by MASHA contributors.