Entity Extraction¶

The entity extraction system protects structured patterns (URLs, emails, dates, etc.) from being mangled by other normalization steps.

How It Works¶

The system uses a three-phase approach:

Extract — Scan text for entity patterns, replace each match with a <<<TYPE_ID>>> placeholder
Process — Apply normalization steps to the remaining text (placeholders are untouched)
Restore — Replace placeholders with the spoken form of each entity

This prevents cascading transformations. For example, without entity extraction, "RM 450000" could be transformed as:

"RM" → "R M" (acronym expansion) → "R meter" (abbreviation expansion)

With entity extraction, "RM 450000" is replaced with <<<CURRENCY_1>>> before any other processing, then restored as "empat ratus lima puluh ribu ringgit" at the end.

`EntityExtractor`¶

`EntityExtractor` ¶

Extracts entities from text and protects them with placeholders.

This allows safe processing of the rest of the text without interfering with entity patterns.

`extract(text, enabled_entities=None)` ¶

Extract all entities from text and replace with placeholders.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text to extract entities from	required
`enabled_entities`	`Optional[list[EntityType]]`	List of entity types to extract (None = all)	`None`

Returns:

Type	Description
`tuple[str, list[Entity]]`	Tuple of (protected_text, entities_list)

Example

extractor = EntityExtractor() protected, entities = extractor.extract("On 15/08/2025") print(protected) "On DATE_1" print(entities[0].text) "15/08/2025"

`restore(text, language)` ¶

Convert entity placeholders back to spoken form.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text with entity placeholders	required
`language`	`str`	Target language ('en' or 'ms')	required

Returns:

Type	Description
`str`	Text with entities converted to spoken form

Example

extractor = EntityExtractor() protected, entities = extractor.extract("15/08/2025") processed = process_text(protected) result = extractor.restore(processed, "en") print(result) "fifteenth of August two thousand and twenty-five"

Constructor¶

from revo_norm.entity_extractor import EntityExtractor

extractor = EntityExtractor()

Creates a new extractor instance with pre-compiled patterns for all entity types.

`extract(text, enabled_entities)`¶

`extract(text, enabled_entities=None)` ¶

Extract all entities from text and replace with placeholders.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text to extract entities from	required
`enabled_entities`	`Optional[list[EntityType]]`	List of entity types to extract (None = all)	`None`

Returns:

Type	Description
`tuple[str, list[Entity]]`	Tuple of (protected_text, entities_list)

Example

extractor = EntityExtractor() protected, entities = extractor.extract("On 15/08/2025") print(protected) "On DATE_1" print(entities[0].text) "15/08/2025"

Extract entities from text and replace them with placeholders.

Parameters:

Name	Type	Default	Description
`text`	`str`	required	Input text to extract entities from
`enabled_entities`	`list[EntityType] \| None`	`None`	List of entity types to extract. `None` extracts all types.

Returns: tuple[str, list[Entity]] — A tuple of:

str — The protected text with entities replaced by <<<TYPE_ID>>> placeholders
list[Entity] — The list of extracted Entity objects, sorted by position

Placeholder format: <<<TYPE_ID>>>

TYPE is the uppercase entity type name (e.g., URL, DATE, CURRENCY)
ID is a sequential integer starting from 1

Example placeholders: <<<URL_1>>>, <<<DATE_2>>>, <<<CURRENCY_3>>>

extractor = EntityExtractor()

protected, entities = extractor.extract(
    "Send email to user@example.com about RM500 on 15/08/2025"
)

# protected: "Send email to <<<EMAIL_1>>> about <<<CURRENCY_2>>> on <<<DATE_3>>>"
# entities contains 3 Entity objects with original text and positions

`restore(text, language)`¶

`restore(text, language)` ¶

Convert entity placeholders back to spoken form.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text with entity placeholders	required
`language`	`str`	Target language ('en' or 'ms')	required

Returns:

Type	Description
`str`	Text with entities converted to spoken form

Example

extractor = EntityExtractor() protected, entities = extractor.extract("15/08/2025") processed = process_text(protected) result = extractor.restore(processed, "en") print(result) "fifteenth of August two thousand and twenty-five"

Convert entity placeholders back to spoken form.

Parameters:

Name	Type	Default	Description
`text`	`str`	required	Text with entity placeholders
`language`	`str`	required	Target language (`"en"` or `"ms"`)

Returns: str — Text with all entity placeholders replaced by their spoken form.

extractor = EntityExtractor()

protected, _ = extractor.extract("Price is RM500")
# protected: "Price is <<<CURRENCY_1>>>"

# ... other processing on protected text ...

result = extractor.restore(protected, "ms")
# result: "Price is lima ratus ringgit"

`entities` attribute¶

list[Entity] — The list of entities found during the most recent extract() call. This list is cleared and repopulated each time extract() is called.

`EntityType`¶

`EntityType` ¶

Bases: str, Enum

Types of entities that can be extracted.

Enum of all entity types that can be extracted. Each value is a lowercase string.

Type	Value	What It Detects
`URL`	`"url"`	HTTP/HTTPS/FTP URLs, IP addresses with ports, domain names
`EMAIL`	`"email"`	Email addresses (`user@example.com`)
`DATE`	`"date"`	DD/MM/YYYY, YYYY-MM-DD, DD Month YYYY, Month DD, YYYY
`TIME`	`"time"`	HH:MM, HH:MM AM/PM, HH:MM:SS
`CURRENCY`	`"currency"`	RM, USD, EUR, GBP, MYR, $, £, € with amounts and optional K/M/B/T suffixes
`FRACTION`	`"fraction"`	Numeric fractions (3/4, 10/4) — excludes date-like patterns
`TEMPERATURE`	`"temperature"`	Temperature values (25C, -5F, 100K, 37.5C)
`IC`	`"ic"`	Malaysian IC numbers (YYMMDD-SS-NNNN)
`HARI_BULAN`	`"hari_bulan"`	Malay day-of-month format (e.g., `15hb`)
`HIJRI`	`"hijri"`	Hijri years (e.g., `1446H`)
`X_KALI`	`"x_kali"`	Multipliers (e.g., `5x`, `10X`)

from revo_norm.entity_extractor import EntityType

# Access entity type values
EntityType.URL       # <EntityType.URL: 'url'>
EntityType.CURRENCY  # <EntityType.CURRENCY: 'currency'>
EntityType.DATE      # <EntityType.DATE: 'date'>

`Entity`¶

`Entity` `dataclass` ¶

Represents an extracted entity from text.

Attributes:

Name	Type	Description
`type`	`EntityType`	Entity type (URL, email, date, etc.)
`text`	`str`	Original matched text
`start`	`int`	Start position in original text
`end`	`int`	End position in original text
`metadata`	`dict`	Additional information (language, format, etc.)
`placeholder_id`	`int`	ID for placeholder generation

Dataclass representing an extracted entity.

Field	Type	Description
`type`	`EntityType`	The entity type (URL, EMAIL, DATE, etc.)
`text`	`str`	The original matched text from the input
`start`	`int`	Start position in the original text
`end`	`int`	End position in the original text
`metadata`	`dict`	Additional information (language, format, etc.) — defaults to `{}`
`placeholder_id`	`int`	ID used for placeholder generation — defaults to `0`

from revo_norm.entity_extractor import Entity, EntityType

entity = Entity(
    type=EntityType.URL,
    text="https://example.com",
    start=5,
    end=24,
    placeholder_id=1,
)

entity.type           # EntityType.URL
entity.text           # "https://example.com"
entity.placeholder_id # 1

Extraction Order¶

Entities are extracted in the following order (most specific first to prevent false matches):

URL — Must be first to prevent IP addresses from being matched by other patterns
EMAIL — Must be early to prevent @ from being replaced by special chars
CURRENCY — Must be early to protect currency codes from acronym/abbreviation expansion
DATE — Protects slash-format dates from fraction pattern
TIME — Protects colon-format times from other processing
TEMPERATURE — Matches 25C, -5F, etc.
FRACTION — Matches 3/4, 10/4 (with negative lookahead to skip dates)
X_KALI — Matches 5x, 10X
IC — Matches Malaysian IC format YYMMDD-SS-NNNN
HARI_BULAN — Matches 15hb, 3HB
HIJRI — Matches 1446H, 1445h

Entity Extraction¶

How It Works¶

EntityExtractor¶

EntityExtractor ¶

extract(text, enabled_entities=None) ¶

restore(text, language) ¶

Constructor¶

extract(text, enabled_entities)¶

extract(text, enabled_entities=None) ¶

restore(text, language)¶

restore(text, language) ¶

entities attribute¶

EntityType¶

EntityType ¶

Entity¶

Entity dataclass ¶

Extraction Order¶

`EntityExtractor`¶

`EntityExtractor` ¶

`extract(text, enabled_entities=None)` ¶

`restore(text, language)` ¶

`extract(text, enabled_entities)`¶

`extract(text, enabled_entities=None)` ¶

`restore(text, language)`¶

`restore(text, language)` ¶

`entities` attribute¶

`EntityType`¶

`EntityType` ¶

`Entity`¶

`Entity` `dataclass` ¶