Acronym Normalization¶
Overview¶
Acronym normalization converts written acronyms and initialisms into their spoken form for TTS. It handles letter-period sequences (I.B.M.), all-caps abbreviations (IBM, API), and hyphenated forms (GPU-accelerated). A pronunciation mappings system provides overrides for terms that should be spoken differently from letter-by-letter splitting.
Acronym expansion runs late in the pipeline, after pronunciation mappings, entity extraction, and language-specific normalization have already processed the text.
Expansion Rules¶
The expand_acronym() function applies rules in priority order:
- Pronunciation mappings -- applied first in the pipeline, before any other transformation (see below).
- Preserved as-is -- certain acronyms are kept unchanged (e.g.,
NASA). - Always split -- specific tech acronyms are always split letter-by-letter.
- Always spell -- Malaysian university/org acronyms that must be spelled out.
- Pronounceable word -- 4+ letter acronyms with 30-60% vowel density are pronounced as words.
- C-V-C pattern -- consonant-vowel-consonant tail pattern (e.g.,
JSON→J son). - Default -- split all letters.
Preserved Acronyms¶
These acronyms are kept as-is because they are pronounceable as complete words:
| Acronym | Output |
|---|---|
| NASA | NASA |
Always-Split Acronyms¶
These tech acronyms are always split letter-by-letter, regardless of pronounceability:
| Acronym | Output |
|---|---|
| API | A P I |
| GPU | G P U |
| CPU | C P U |
| AI | A I |
| ML | M L |
| DL | D L |
| NLP | N L P |
| LLM | L L M |
| RL | R L |
| PLUS | P L U S |
Always-Spell (Malaysian Universities/Orgs)¶
These Malaysian university and organization acronyms are always spelled letter-by-letter, even though some look pronounceable:
| Acronym | Output |
|---|---|
| UITM | U I T M |
| UKM | U K M |
| USM | U S M |
| UTM | U T M |
| UPNM | U P N M |
| IIUM | I I U M |
| UM | U M |
| UPM | U P M |
Pronounceable Word Detection¶
For 4+ letter acronyms not in any special list, the system checks vowel density. If 30-60% of the letters are vowels and the word has consonants, it's treated as a pronounceable word and converted to lowercase:
| Acronym | Vowel Ratio | Output |
|---|---|---|
| MARA | 50% | mara |
| JAKIM | 40% | jakim |
| FELDA | 40% | felda |
| BERNAMA | 43% | bernama |
| PETRONAS | 38% | petronas |
| FINAS | 40% | finas |
| RAPID | 40% | rapid |
| BERNAS | 33% | bernas |
| PTPTN | 0% | P T P T N |
C-V-C Pattern (Generalized Rule)¶
For 3+ letter acronyms that don't match the pronounceable word check, the generalized rule checks the tail (all letters after the first). If the tail has a consonant at the start, a vowel somewhere in the middle, and a consonant at the end, the first letter is spoken individually and the tail is pronounced as a word.
| Acronym | Tail Analysis | Output |
|---|---|---|
| JSON | "son" -- consonant-vowel-consonant | J son |
| JPEG | "peg" -- consonant-vowel-consonant | J peg |
| PNG | "ng" -- too short | P N G |
| FBI | "bi" -- too short | F B I |
Acronyms that don't match any rule are split letter-by-letter by default.
Letter-Period Sequences¶
Letter-period sequences (with dots between letters) are expanded by removing the periods and inserting spaces:
from revo_norm import normalize_text
normalize_text("I.B.M. released a new chip", language="en")
# "I B M released a new chip"
normalize_text("U.S.A. today", language="en")
# "U S A today"
normalize_text("I.T. department", language="en")
# "I T department"
Hyphenated Acronyms¶
Hyphens between letters or between an acronym and a word are replaced with spaces:
normalize_text("GPU-accelerated rendering", language="en")
# "G P U accelerated rendering"
normalize_text("AI-powered tools", language="en")
# "A I powered tools"
normalize_text("state-of-the-art", language="en")
# "state of the art"
Pronunciation Mappings Override¶
Pronunciation mappings are applied before acronym expansion and have the highest priority in the pipeline. They override the default behavior for specific terms.
from revo_norm import normalize_text
# GUI -> gooey (not "G U I")
normalize_text("Build a GUI interface", language="en")
# "Build a gooey interface"
# ASCII -> as key (not "A S C I I")
normalize_text("ASCII art", language="en")
# "as key art"
# GIF -> gif (not "G I F")
normalize_text("GIF format", language="en")
# "gif format"
# IEEE -> I triple E (not "I E E E")
normalize_text("IEEE standard", language="en")
# "I triple E standard"
# WiFi -> why fi (not "W I F I")
normalize_text("Connect to WiFi", language="en")
# "Connect to why fi"
# iOS -> I O S (not "i O S")
normalize_text("iOS update", language="en")
# "I O S update"
Adding Custom Pronunciation Mappings¶
Mappings must represent how a term sounds when spoken, not what it stands for. The system validates against abbreviation expansions.
from revo_norm.pronunciation_mappings import add_custom_mapping
from revo_norm import normalize_text
# Add a pronunciation mapping
add_custom_mapping("SQL", "sequel")
normalize_text("Query the SQL database", language="en")
# "Query the sequel database"
# This will raise ValueError -- it's an expansion, not a pronunciation
# add_custom_mapping("YOLO", "you only live once")
Custom mappings apply globally (both languages) and are matched as whole words, case-insensitively.
Built-in Pronunciation Mappings¶
| Term | Spoken Form |
|---|---|
| GUI | gooey |
| ASCII | as key |
| IEEE | I triple E |
| GIF | gif |
| WiFi | why fi |
| iOS | I O S |
| bias | bai yers |
| Hj | Haji |
| Hjh | Hajah |
| Dr | Doktor |
| Prof | Profesor |
| Dato' | Dato |
How to Disable¶
from revo_norm import normalize_text
# Disable acronym expansion entirely
result = normalize_text("The API is fast", language="en", disable=["acronyms"])
# "The API is fast" (unchanged)
# Disable with minimal profile
result = normalize_text("The API is fast", language="en", profile="minimal")
# "The API is fast" (no acronym expansion)
Edge Cases¶
- Length limit: Only acronyms of 2-10 uppercase letters are expanded. Longer sequences are left as-is.
- Lowercase/mixed case: Only fully uppercase words are treated as acronyms.
Apiandapiare not expanded. - Inside entity placeholders: Acronyms inside entity placeholders (
<<<...>>>) are protected from expansion. - Pronunciation mappings run first: Even with acronyms disabled, pronunciation mappings still apply. They are a separate pipeline step.
- After language normalizer: Acronym expansion runs after language-specific normalization, so numbers have already been converted.
- Letter-period minimum: At least 2 letter-period pairs are required:
A.is not expanded, butA.B.is.