Currency Normalization¶
Overview¶
Currency normalization converts written currency expressions into their spoken form for TTS. It supports multiple currencies, suffix-based magnitudes (K/M/B/T), decimal amounts, and produces bilingual output for English and Malay.
Currency extraction runs early in the pipeline to prevent other normalizers (acronym expansion, abbreviation expansion) from splitting symbols like "RM" into "R M" or "R meter".
Supported Currencies¶
| Symbol | Name | Sub-unit (EN) | Sub-unit (MS) |
|---|---|---|---|
RM |
Ringgit Malaysia | cent | sen |
MYR |
Ringgit Malaysia | cent | sen |
$ |
US Dollar | cent | sen |
USD |
US Dollar | cent | sen |
€ |
Euro | cent | sen |
EUR |
Euro | cent | sen |
£ |
British Pound | pence | pence |
GBP |
British Pound | pence | pence |
Suffix Expansion¶
Currency amounts can include magnitude suffixes that are expanded to their full numeric value before conversion to spoken form.
| Suffix | Meaning | Multiplier |
|---|---|---|
K |
Thousand | x1,000 |
M |
Million | x1,000,000 |
B |
Billion | x1,000,000,000 |
T |
Trillion | x1,000,000,000,000 |
Suffix expansion is the first step in the pipeline, running before entity extraction and URL processing.
Examples¶
Whole Amounts¶
Input: "The price is RM450000"
Output: "The price is empat ratus lima puluh ribu ringgit"
Input: "It costs $100"
Output: "It costs one hundred dollar"
Decimal Amounts¶
Input: "RM5.50"
Output: "lima ringgit lima puluh sen"
Input: "$0.99"
Output: "ninety nine cents"
Input: "$0.50"
Output: "fifty cents"
When the whole-number part is zero, the main unit is omitted. $0.50 becomes "fifty cents", not "zero dollar fifty cents".
Suffix Expansion¶
Input: "RM30K"
Output: "tiga puluh ribu ringgit"
Input: "RM1.5M"
Output: "satu juta lima ratus ribu ringgit"
Input: "$5B"
Output: "five billion dollar"
Input: "RM1T"
Output: "satu trilion ringgit"
Currency with Magnitude Words¶
The language normalizers also handle magnitude words after the amount:
Input: "RM2.5 million" (English pipeline)
Output: "two point five million ringgit"
Input: "RM2.5 juta" (Malay pipeline)
Output: "dua perpuluhan lima juta ringgit"
Currency with Commas¶
Input: "RM1,000,000"
Output: "satu juta ringgit"
Input: "$7,832"
Output: "seven thousand eight hundred thirty two dollar"
Multiple Currencies in One Sentence¶
How Entity Extraction Protects Currency¶
Currency is extracted as an entity before other normalizers run. The entity extractor replaces the currency expression with a placeholder (e.g., <<<CURRENCY_1>>>), processes the rest of the text, then restores the currency in spoken form.
This prevents the following cascading failures:
| Without Entity Extraction | With Entity Extraction |
|---|---|
| "RM 450" -> "R M 450" -> "R meter four hundred fifty" | "RM 450" -> "empat ratus lima puluh ringgit" |
| "$50K" -> "dollar five zero K" | "$50K" -> "fifty thousand dollar" |
Configuration¶
Currency normalization cannot be fully disabled -- it always runs as part of the entity extraction system to protect currency symbols from being mangled by other normalizers. However, you can use a minimal profile to reduce other processing:
from revo_norm import normalize_text
# Currency always gets extracted and protected
result = normalize_text("RM450", language="ms")
# Use minimal profile (only spacing normalization besides entity extraction)
result = normalize_text("RM450", language="ms", profile="minimal")
Edge Cases¶
- Sub-unit only:
RM0.50produces "lima puluh sen" (ringgit unit omitted). - Decimal padding:
RM5.5is treated the same asRM5.50-- the fraction is padded to two digits. - Trailing decimal:
RM5.is treated as a whole numberRM5. - Case insensitivity:
rm30kandRM30Kare both handled. - Space between symbol and amount: Both
RM100andRM 100are recognized. - Comma-separated amounts:
RM1,000,000has commas stripped before number-to-words conversion.