normalize_text¶
The primary entry point for text normalization.
normalize_text(text, language='en', profile=None, disable=None, verbose=False, **kwargs)
¶
Normalize text for TTS in the given language.
Parameters¶
text : str
Input text to normalize.
language : str
"en" for English, "ms" for Malay.
profile : str or None
One of "minimal", "basic", "standard", "aggressive".
If None the standard profile (all features on) is used.
disable : list[str] or None
Feature names to turn off, e.g. ["acronyms", "measurements"].
verbose : bool
If True, return a dict with mappings and triggered rules instead of
just the normalized text string.
**kwargs
Legacy boolean flags — accepted for backward compatibility but emit
a DeprecationWarning. Supported names:
normalize_spacing, fix_dot_letters, sound_words_field,
apply_pronunciation_overrides_flag, expand_abbreviations_flag,
expand_acronyms_flag, normalize_elongated_flag,
normalize_fractions_flag, normalize_x_kali_flag,
normalize_temperature_flag, normalize_ic_flag,
normalize_measurements_flag, normalize_hari_bulan_flag,
normalize_hijri_flag, extract_entities_first, config.
Returns¶
str or dict
Normalized text, or a dict with text, original, mappings,
and rules keys when verbose is True.
Parameters¶
text {: #text }¶
str — The input text to normalize. Leading and trailing whitespace is stripped before processing. If the string is empty after stripping, an empty string is returned immediately.
language {: #language }¶
str, default "en" — Target language for normalization.
| Value | Description |
|---|---|
"en" |
English normalization (contractions, numbers via inflect, English date/time formats) |
"ms" |
Malay normalization (Malay grammar, numbers via num2word_ms, Malay-specific features) |
profile {: #profile }¶
str | None, default None — A preset configuration profile. When provided, determines which feature groups are enabled.
| Profile | Description |
|---|---|
"minimal" |
Spacing normalization only |
"basic" |
Spacing + acronyms + abbreviations + elongated + Malay-local + special chars |
"standard" |
All features enabled (same as default when profile=None) |
"aggressive" |
All features enabled + strips [...] content |
When None, the standard profile (all features on) is used.
disable {: #disable }¶
list[str] | None, default None — A list of feature names to turn off. Feature names correspond to fields on Config. Unknown names are ignored.
Common feature names:
"acronyms"— Disable acronym expansion (I.B.M., API, etc.)"measurements"— Disable measurement normalization (5km, 10kg, etc.)"temperature"— Disable temperature normalization (25C, -5F, etc.)"fractions"— Disable fraction normalization (3/4, 10/4, etc.)"dates"— Disable date-to-spoken conversion"times"— Disable time-to-spoken conversion"spacing"— Disable whitespace normalization"abbreviations"— Disable abbreviation expansion (currently a no-op)"elongated"— Disable elongated word normalization"special_chars"— Disable special character replacement (&, +, %, etc.)"pronunciation_overrides"— Disable pronunciation overrides
**kwargs (legacy flags)¶
Legacy boolean flags accepted for backward compatibility. Using any of these emits a DeprecationWarning.
Supported legacy names:
normalize_spacing, fix_dot_letters, sound_words_field, apply_pronunciation_overrides_flag, expand_abbreviations_flag, expand_acronyms_flag, normalize_elongated_flag, normalize_fractions_flag, normalize_x_kali_flag, normalize_temperature_flag, normalize_ic_flag, normalize_measurements_flag, normalize_hari_bulan_flag, normalize_hijri_flag, extract_entities_first, config
Return Value¶
str — The normalized text, ready for TTS processing.
Examples¶
Basic usage¶
from revo_norm import normalize_text
# English
result = normalize_text("The API is fast and costs $5.50", language="en")
# "The A P I is fast and costs five dollar fifty cents"
# Malay
result = normalize_text("RM30K untuk projek ML", language="ms")
# "tiga puluh ribu ringgit untuk projek M L"
With a profile¶
from revo_norm import normalize_text
# Minimal — only whitespace cleanup
result = normalize_text("The API is fast", language="en", profile="minimal")
# "The API is fast"
# Basic — adds acronym/abbreviation/special chars
result = normalize_text("5km & 10kg", language="en", profile="basic")
With disabled features¶
from revo_norm import normalize_text
# Keep acronyms as-is (no letter splitting)
result = normalize_text("Build the API with ML", language="en", disable=["acronyms"])
# "Build the API with ML"
# Disable multiple features
result = normalize_text(
"25C and 3/4 of 5km",
language="en",
disable=["temperature", "fractions", "measurements"],
)
Legacy flags (deprecated)¶
import warnings
# Legacy flags still work but emit DeprecationWarning
with warnings.catch_warnings():
warnings.simplefilter("ignore", DeprecationWarning)
result = normalize_text(
"25C outside",
language="en",
normalize_temperature_flag=False,
)
Pipeline Steps¶
When normalize_text() is called, the following steps execute in order:
- Currency suffix expansion —
RM30KbecomesRM30000,RM1MbecomesRM1000000 - Entity extraction — Entities are detected and replaced with
<<<TYPE_ID>>>placeholders - Pronunciation mappings — Explicit mappings (e.g.,
GUIto "gooey") applied first - Placeholder stashing — Entity placeholders are replaced with safe alphabetic tokens
- Feature-gated processing:
- Pronunciation overrides
- Elongated word normalization
- Measurement normalization
- X-kali normalization
- Language-specific normalization (English or Malay)
- Spacing normalization
- Sound word removal
- Abbreviation expansion
- Acronym expansion
- Comma insertion for repeated words
- Special character replacement
- Entity restoration — Placeholders are restored as spoken form