Changelog¶
All notable changes to revo-norm are documented here.
v0.2.0-dev (Current)¶
Single unified pipeline architecture with entity extraction always enabled.
Added¶
- Single pipeline architecture: Entity extraction is always enabled, replacing the dual-pipeline (legacy vs entity extraction) approach
profile=parameter: Preset configurations (minimal,basic,standard,aggressive) passed directly tonormalize_text()disable=parameter: Disable specific features by name instead of individual boolean flagsConfigdataclass: Simple feature-toggle configuration replacingNormalizationConfig,FeatureGroup,FeatureLevel, andProfileenums- Pronunciation mappings system: Explicit mappings applied first in pipeline (
GUI->gooey,ASCII->as key,IEEE->I triple E) - Custom pronunciation mappings:
add_custom_mapping()API for user-defined overrides - Currency entity extraction:
RM,$,EUR,GBPamounts protected from downstream acronym/abbreviation expansion - Currency K/M/B/T suffix expansion:
RM30K->RM30000,RM1.5M->RM1500000 - Date recognition:
15/08/2025-> spoken date form (DD/MM/YYYY and YYYY-MM-DD formats) - Time recognition:
3:30 pm-> spoken time form - Malay-specific features: IC numbers, hari bulan, hijri years, x-kali multipliers, elongated word normalization
- Entity types: URL, email, date, time, currency, fraction, temperature, IC, hari_bulan, hijri, x_kali
Changed¶
- Entity extraction runs automatically (no
extract_entities_firstflag needed) - Acronym expansion merged into
replace_letter_period_sequences() - 96 acronyms removed from abbreviation list to avoid conflicts with acronym expansion
- Single-letter and short uppercase abbreviation expansion disabled to prevent breaking domain terms
- Pre-compiled regex patterns at module level for performance
Deprecated¶
- Boolean flags in
normalize_text()(e.g.,normalize_temperature_flag=False) -- usedisable=["temperature"] NormalizationConfigclass -- useConfigFeatureGroup,FeatureLevel,Profileenums -- use plain stringsminimal_config(),basic_config(),standard_config(),aggressive_config()-- useConfig.from_profile()orprofile=parameterconfig.with_feature()-- set attributes directly:cfg.acronyms = Falseconfig.with_sound_words()-- setcfg.sound_wordsdirectly
Fixed¶
- Date/fraction pattern conflict (entity extraction prevents false fraction matches on dates)
RMcurrency split to "R M" by acronym expansion (entity extraction protects currency)JSON/ML/AIunwanted splitting by pronunciation mappings and acronym rules- Double pronunciation override call
- Hari bulan placeholder robustness (
__HARI_BULAN__) - Hyphens between words replaced with spaces before acronym expansion
v0.1.0¶
Initial release.
- Basic text normalization for English and Malay
- Number-to-words conversion
- Contraction expansion (English)
- Abbreviation expansion
- Acronym expansion
- URL and email to spoken form
- Currency normalization
- Temperature normalization
- Fraction normalization
- Measurement normalization