Skip to content

Measurement Normalization

Overview

Measurement normalization converts measurement expressions (distance, volume, weight, duration, area) into their spoken form for TTS. It handles metric and imperial units with bilingual output for English and Malay.

Measurements are normalized before the language-specific normalizer and before acronym expansion to prevent units like "km" from being split into "k m".

Supported Units

Distance

Unit English Malay
km kilometers kilometer
m meters meter
cm centimeters sentimeter
mm millimeters milimeter
mi miles batu
ft feet kaki
in inches inci
yd yards ela
batu miles batu
kaki feet kaki
inci inches inci

Volume

Unit English Malay
ml milliliters mililiter
l liters liter
gal gallons gelen

Weight

Unit English Malay
kg kilograms kilogram
g grams gram
mg milligrams miligram
lb pounds paun
oz ounces auns

Duration

Unit English Malay
hour / hours hour / hours jam
minute / minutes minute / minutes minit
second / seconds second / second saat
jam hours jam
minit minutes minit
saat seconds saat

Area

Unit English Malay
sq ft / sqft square feet kaki persegi

Examples

Distance

from revo_norm import normalize_text

# English
normalize_text("5km away", language="en")
# "five kilometers away"

normalize_text("100m sprint", language="en")
# "one hundred meters sprint"

normalize_text("30mi drive", language="en")
# "thirty miles drive"

# Malay
normalize_text("5km jauh", language="ms")
# "lima kilometer jauh"

normalize_text("100m lari", language="ms")
# "seratus meter lari"

Volume

# English
normalize_text("500ml bottle", language="en")
# "five hundred milliliters bottle"

normalize_text("2l jug", language="en")
# "two liters jug"

# Malay
normalize_text("500ml botol", language="ms")
# "lima ratus mililiter botol"

Weight

# English
normalize_text("75kg person", language="en")
# "seventy five kilograms person"

normalize_text("500g flour", language="en")
# "five hundred grams flour"

# Malay
normalize_text("75kg orang", language="ms")
# "tujuh puluh lima kilogram orang"

normalize_text("500g tepung", language="ms")
# "lima ratus gram tepung"

Duration

# English
normalize_text("5 hours to complete", language="en")
# "five hours to complete"

normalize_text("30 minutes wait", language="en")
# "thirty minutes wait"

# Malay
normalize_text("5 jam untuk siap", language="ms")
# "lima jam untuk siap"

normalize_text("30 minit tunggu", language="ms")
# "tiga puluh minit tunggu"

Area

# English
normalize_text("1000 sq ft apartment", language="en")
# "one thousand square feet apartment"

# Malay
normalize_text("1000 sq ft pangsapuri", language="ms")
# "seribu kaki persegi pangsapuri"

Decimal Values

Measurements support decimal values. Commas in numbers are converted to decimal points.

normalize_text("1,5km away", language="en")
# "one point five kilometers away"

normalize_text("2.5kg of rice", language="en")
# "two point five kilograms of rice"

How to Disable

from revo_norm import normalize_text

# Disable measurement normalization
result = normalize_text("5km away", language="en", disable=["measurements"])
# "five k m away"  (number normalized, but unit split by acronym expander)

# Use minimal profile (measurements not normalized)
result = normalize_text("5km away", language="en", profile="minimal")

When measurements are disabled, the unit abbreviation may still be split by the acronym expander in a subsequent step (e.g., "km" becomes "k m").

Edge Cases

  • Case insensitivity: Both 5KM and 5km are recognized.
  • Space between number and unit: Both 5km and 5 km are handled.
  • Negative values: -5C is handled (primarily for temperature, but the measurement pattern also supports negative values).
  • Duration English/Malay cross-use: Malay duration words (jam, minit, saat) are recognized in both English and Malay contexts.
  • No area unit for sq m: Currently only sq ft / sqft is supported for area. Square meters and other area units are not handled.
  • Unit after number-to-words: Measurement normalization runs before the language normalizer, so the numeric value is still in digit form when matched. The language normalizer then converts the spoken number to words.