Skip to content

Language Support

Revo-norm supports two languages with awareness of code-mixing patterns common in Southeast Asian contexts.

Supported Languages

Code Language Notes
en English Full normalization with contractions, ordinals, abbreviations
ms Malay (Bahasa Melayu) Full normalization with Malay number words, currency, local features
from revo_norm import normalize_text

result_en = normalize_text("25C outside", language="en")
# "twenty five degrees celsius outside"

result_ms = normalize_text("25C di luar", language="ms")
# "dua puluh lima darjah selsius di luar"

Code-Mixing

Malaysian Malay text frequently mixes in English terms. Revo-norm handles this naturally -- the language parameter controls which normalizer runs on the non-entity text, but English technical terms (API, ML, GPU, etc.) are handled consistently regardless of language.

# English terms in a Malay sentence
result = normalize_text("Projek ML ni guna 5GB RAM", language="ms")
# "Projek M L ni guna lima gigabyte R A M"

# Malay terms in an English sentence
result = normalize_text("The ringgit fell 5% today", language="en")
# "The ringgit fell five percent today"

Pronunciation mappings apply to both languages equally -- terms like GUI, ASCII, and IEEE are mapped to their spoken forms regardless of the language parameter.


English Specifics (language="en")

Contractions

Common English contractions are expanded to their full forms for clean TTS output:

normalize_text("I'm happy they can't come", language="en")
# "I am happy they cannot come"

Supported contractions include: - Pronoun + verb: I'm -> I am, you're -> you are, it's -> it is - Negations: don't -> do not, can't -> cannot, won't -> will not - Perfect tense: should've -> should have, could've -> could have - Question words: what's -> what is, where's -> where is

Over 60 contractions are handled, including negative forms, possessive-like forms, and perfect tense contractions.

Ordinals

English ordinal numbers are converted to their spoken form:

normalize_text("She finished 1st in the 21st century", language="en")
# "She finished first in the twenty first century"

Abbreviations

Common title and place abbreviations are expanded:

Abbreviation Spoken Form
Mr. Mister
Mrs. Misess
Dr. Doctor
St. Saint
Jr. Junior
Ltd. Limited
Capt. Captain
Gen. General
Sgt. Sergeant

Numbers and Currency

normalize_text("1,000,000 dollars at 3.5%", language="en")
# "one million dollars at three point five percent"

normalize_text("The price is $45.99", language="en")
# "The price is forty five dollar ninety nine cent"

Years

Four-digit numbers in the 1000-2099 range are read as years:

normalize_text("Born in 1990, graduated 2012", language="en")
# "Born in nineteen ninety, graduated twenty twelve"

Dates and Times

normalize_text("Meeting on 15/08/2025 at 3:30 pm", language="en")
# "Meeting on fifteenth of August, two thousand and twenty five at three thirty p m"

Malay Specifics (language="ms")

Number Words

Numbers are converted to Malay number words:

Number Malay
0 kosong
1 satu
10 sepuluh
11 sebelas
42 empat puluh dua
100 seratus
1000 seribu
1,000,000 satu juta

The special contracted forms are used automatically: - sebelas (eleven, not satu belas) - sepuluh (ten, but the combined form in compounds) - seratus (one hundred, not satu ratus) - seribu (one thousand, not satu ribu)

normalize_text("Ada 115 orang", language="ms")
# "Ada seratus lima belas orang"

Currency

Ringgit Malaysia (RM) is the primary currency, with ringgit as the main unit and sen as the subunit:

normalize_text("Harga RM 450.50", language="ms")
# "Harga empat ratus lima puluh ringgit lima puluh sen"

normalize_text("RM30K untuk projek ni", language="ms")
# "tiga puluh ribu ringgit untuk projek ni"

Supported currency symbols: RM, $, £, EUR, USD, GBP, MYR. Amount suffixes K, M, B, T are expanded to full numbers.

Date Months

Malay month names are used when normalizing dates:

Month Malay
January Januari
February Februari
March Mac
April April
May Mei
June Jun
July Julai
August Ogos
September September
October Oktober
November November
December Disember
normalize_text("Tarikh: 15/08/2025", language="ms")
# "Tarikh: lima belas Ogos dua ribu dua puluh lima"

Time Meridians

Malay time expressions use pagi (morning/AM) and petang (afternoon/PM):

normalize_text("Jumpa pukul 8:30 pagi", language="ms")
# "Jumpa pukul lapan tiga puluh pagi"

Malaysian Identity Card Numbers (IC)

Malaysian IC numbers (12 digits in YYMMDD-SS-XXXX format) are normalized to spoken form:

normalize_text("No IC: 901231-10-5678", language="ms")
# The IC number is expanded digit by digit with structural grouping

Elongated Words

Common in informal Malay text, repeated characters are reduced:

normalize_text("Sedihhh sangat laa", language="ms")
# "Sedih sangat la"

Measurements

Unit abbreviations are expanded to their spoken Malay forms:

normalize_text("Berat 5kg, jarak 10km", language="ms")
# "Berat lima kilogram, jarak sepuluh kilometer"

Decimals and Percentages

normalize_text("Kadar 3.5% setahun", language="ms")
# "Kadar tiga perpuluhan lima peratus setahun"

Hijri Calendar

Islamic/Hijri year conversion is supported:

normalize_text("Tahun Hijri 1446", language="ms")
# Hijri year converted to spoken form

Common Gotchas

DD/MM vs MM/DD Date Ambiguity

The library assumes DD/MM/YYYY format (common in Malaysia and most of the world), not MM/DD/YYYY (US format). This is a heuristic-based approach and ambiguous dates where both day and month are 12 or less may be parsed incorrectly.

# Unambiguous: day > 12
normalize_text("15/08/2025", language="en")
# "fifteenth of August, two thousand and twenty five" (correct)

# Ambiguous: both <= 12
normalize_text("05/06/2025", language="en")
# Treated as 5th of June (DD/MM), not June 5th (MM/DD)

If your source text uses US date format (MM/DD/YYYY), consider pre-processing dates before passing them to revo-norm.

"RM" in Currency vs Other Meanings

RM is recognized as Ringgit Malaysia in currency contexts. In non-currency contexts, it may be treated differently:

# Currency context -- handled correctly
normalize_text("Harga RM 450", language="ms")
# "Harga empat ratus lima puluh ringgit"

# As part of a word or acronym -- protected by pronunciation mappings
normalize_text("The RM team", language="en")
# "The R M team" (split as acronym)

All-Caps Tech Terms (AI, ML, LLM)

All-caps tech terms like AI, ML, LLM, GPU, API are handled by the pronunciation mappings and acronym expansion systems. The pipeline applies pronunciation mappings first (highest priority), then acronym expansion:

# AI, ML, LLM -- split letter by letter by expand_acronym()
normalize_text("Train ML models with AI", language="en")
# "Train M L models with A I"

# GUI -- mapped to "gooey" by pronunciation mappings (applied first)
normalize_text("Build a GUI app", language="en")
# "Build a gooey app"

# Custom pronunciation
from revo_norm.pronunciation_mappings import add_custom_mapping
add_custom_mapping("YOLO", "you only live once", "en")
normalize_text("YOLO approach", language="en")
# "you only live once approach"

Entity Extraction Protects Patterns

The entity extraction system runs early in the pipeline and protects recognized patterns from being mangled by downstream transformations. This prevents issues like:

  • RM 450 being split into R M 450 by acronym expansion (currency is extracted as an entity first)
  • Dates like 15/08/2025 being interpreted as fractions
  • URLs containing numbers from being partially converted

If a specific entity type is causing issues, you can disable it:

normalize_text(text, language="ms", disable=["fractions"])