Encoding UTF-8 quebrado em fornecedor (mojibake Navega��o VJB) #9
Labels
No labels
area/api
area/auth
area/dashboard
area/db
area/frontend
area/llm
area/scrapers
meta
priority/critical
priority/high
priority/low
priority/medium
type/bug
type/feature
type/infra
type/refactor
type/security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
soberania-brasileira/digital#9
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Contexto
Alguns nomes de fornecedor no DB estão com mojibake — bytes UTF-8 lidos como Latin-1 e re-encodados:
Navega��o VJB(deveria serNavegação VJB)BRASOFTWARE INFORM�TICA LTDA(deveria serBRASOFTWARE INFORMÁTICA LTDA)Causa provável
Algum scraper antigo (provavelmente TCE-SP que usa CSV) lê com encoding errado e grava no Postgres já corrompido.
Tarefas
grep -nE "encoding=|charset=|cp1252|latin" scripts/scraper_*.py)psycopg2client_encodingem todos os scripts (deveria ser UTF8 sempre)fornecedor LIKE '%�%') e tentarbytes(s, 'latin-1').decode('utf-8')para inverterSELECT COUNT(DISTINCT fornecedor) FROM contratos WHERE fornecedor LIKE '%�%'