Skip to content

Quickstart Guide

Get started with revo-norm in under 5 minutes.

Basic Usage

The library exposes a single entry point — normalize_text() — that handles the entire pipeline.

from revo_norm import normalize_text

Pass your text and specify the language ("en" for English, "ms" for Malay):

# English
result = normalize_text("The meeting is at 3:30 pm on 15/08/2025", language="en")
print(result)
# "The meeting is at three thirty p m on fifteen of August twenty twenty-five"

# Malay
result = normalize_text("RM450 untuk projek ini", language="ms")
print(result)
# "empat ratus lima puluh ringgit untuk projek ini"

Note

Malay text often mixes in English terms (code-mixing). Revo-norm handles this naturally — English terms in Malay sentences are preserved unless a specific normalization rule applies.

Configuration Profiles

Revo-norm ships with four preset profiles that control how aggressively text is normalized.

# Use a profile via the profile parameter
normalize_text(text, language="en", profile="standard")

Profile Comparison

Feature minimal basic standard aggressive
Spacing cleanup Yes Yes Yes Yes
Acronyms Yes Yes Yes
Abbreviations Yes Yes Yes
Elongated words Yes Yes Yes
Malay-local features Yes Yes Yes
Special characters Yes Yes Yes
Numbers Yes Yes
Dates Yes Yes
Times Yes Yes
Temperature Yes Yes
Fractions Yes Yes
Measurements Yes Yes
IC numbers Yes Yes
Hijri years Yes Yes
Hari bulan Yes Yes
X-kali Yes Yes
Pronunciation overrides Yes Yes
Strip [...] content Yes

When no profile is specified, standard is used (all features enabled).

When to Use Each Profile

minimal

Best for text that is already mostly clean. Only collapses extra whitespace. Use when you need raw text with no transformations.

basic

Good for conversational text where you want acronym expansion and basic cleanup, but don't need number-to-words or entity normalization.

standard (default)

The right choice for most TTS applications. Full normalization with entity protection for dates, currency, and URLs.

aggressive

Everything in standard plus strips all bracketed content like [laughter], [music], [applause]. Best for social media or noisy transcripts where [...] annotations are common.

Disabling Specific Features

Use the disable parameter to turn off individual features without changing the overall profile.

# Disable acronym expansion
normalize_text("The API is fast", language="en", disable=["acronyms"])
# "The API is fast"  (acronym preserved instead of expanded to "A P I")

# Disable multiple features
normalize_text(
    "Temperature is 25C and 3/4 done",
    language="en",
    disable=["temperature", "fractions"],
)
# Temperature and fractions left as-is

Available Feature Names

These are the feature names you can pass in the disable list:

Feature Name Controls
acronyms Acronym expansion (IBM → "I B M")
abbreviations Abbreviation expansion
spacing Whitespace normalization
measurements Measurement units (5km → "five kilometers")
dates Date recognition and conversion
times Time recognition and conversion
temperature Temperature normalization (25C → "twenty-five celsius")
fractions Fraction normalization (3/4 → "three over four")
x_kali Malay x-kali multiplier
ic Malaysian IC number normalization
hari_bulan Malay hari bulan format
hijri Hijri year conversion
elongated Elongated word normalization (soooo → "so")
malay_local Malay-local features
special_chars Special character replacement (& → "and", % → "percent")
pronunciation_overrides Built-in pronunciation overrides

Common Examples

Currency

# English — K-suffix expansion
normalize_text("The budget is RM50K", language="en")
# "The budget is fifty thousand ringgit"

# Malay — full spoken form
normalize_text("Harga RM450", language="ms")
# "Harga empat ratus lima puluh ringgit"

# Large suffixes (M = million, B = billion, T = trillion)
normalize_text("Revenue hit $1.5M", language="en")
# "Revenue hit one million, five hundred thousand dollar"

Dates

# DD/MM/YYYY format
normalize_text("Due on 15/08/2025", language="en")
# "Due on fifteen of August twenty twenty-five"

# ISO format (YYYY-MM-DD)
normalize_text("Deadline: 2025-08-15", language="en")
# "Deadline: August the fifteenth, twenty twenty-five"

Warning

Date parsing uses heuristics for ambiguous formats (e.g., 01/02/2025 could be DD/MM or MM/DD). The parser assumes DD/MM ordering. If your input uses MM/DD, results may differ.

Times

# With meridiem
normalize_text("Meeting at 3:30 pm", language="en")
# "Meeting at three thirty p m"

# 24-hour format
normalize_text("Departs at 14:00", language="en")
# "Departs at fourteen zero"

Acronyms

# Letter-by-letter splitting
normalize_text("IBM announced results", language="en")
# "I B M announced results"

# Pronounceable acronyms
normalize_text("Parse JSON data", language="en")
# "Parse J son data"

# Preserved acronyms (well-known words)
normalize_text("NASA launched", language="en")
# "NASA launched"

Temperature

# Celsius
normalize_text("It is 25C outside", language="en")
# "It is twenty-five celsius outside"

# Fahrenheit
normalize_text("Temperature is 98F", language="en")
# "Temperature is ninety-eight fahrenheit"

Fractions

# Common fractions
normalize_text("Pour 3/4 cup of water", language="en")
# "Pour three over four cup of water"

# Other fractions
normalize_text("Use 1/2 teaspoon", language="en")
# "Use one over two teaspoon"

Measurements

# Distance
normalize_text("Run 5km every day", language="en")
# "Run five kilometers every day"

# Weight
normalize_text("Buy 10kg of rice", language="en")
# "Buy ten kilogram of rice"

# Data
normalize_text("Download 3GB file", language="en")
# "Download three gigabyte file"

Malay-Specific

# Currency in Malay
normalize_text("RM450 untuk projek ini", language="ms")
# "empat ratus lima puluh ringgit untuk projek ini"

# X-kali multiplier (Malay)
normalize_text("5x ganda", language="ms")
# "lima kali ganda"

# Temperature in Malay
normalize_text("Suhu 25C hari ini", language="ms")
# "Suhu dua puluh lima celcius hari ini"

Custom Pronunciation Mappings

Add your own pronunciation overrides. These take the highest priority in the pipeline and are applied before any other transformation.

from revo_norm import normalize_text, add_custom_mapping

# Add a custom mapping
add_custom_mapping("YOLO", "you only live once", "en")

normalize_text("YOLO approach", language="en")
# "you only live once approach"

Warning

add_custom_mapping() modifies module-level state and is not thread-safe. Add all custom mappings at application startup, before any calls to normalize_text().

Next Steps

  • API Reference — full function signatures and parameters
  • Features — detailed guides for each normalization feature