Entity Extraction¶
The entity extraction system protects structured patterns (URLs, emails, dates, etc.) from being mangled by other normalization steps.
How It Works¶
The system uses a three-phase approach:
- Extract — Scan text for entity patterns, replace each match with a
<<<TYPE_ID>>>placeholder - Process — Apply normalization steps to the remaining text (placeholders are untouched)
- Restore — Replace placeholders with the spoken form of each entity
This prevents cascading transformations. For example, without entity extraction, "RM 450000" could be transformed as:
"RM"→"R M"(acronym expansion) →"R meter"(abbreviation expansion)
With entity extraction, "RM 450000" is replaced with <<<CURRENCY_1>>> before any other processing, then restored as "empat ratus lima puluh ribu ringgit" at the end.
EntityExtractor¶
EntityExtractor
¶
Extracts entities from text and protects them with placeholders.
This allows safe processing of the rest of the text without interfering with entity patterns.
extract(text, enabled_entities=None)
¶
Extract all entities from text and replace with placeholders.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text to extract entities from |
required |
enabled_entities
|
Optional[list[EntityType]]
|
List of entity types to extract (None = all) |
None
|
Returns:
| Type | Description |
|---|---|
tuple[str, list[Entity]]
|
Tuple of (protected_text, entities_list) |
Example
extractor = EntityExtractor() protected, entities = extractor.extract("On 15/08/2025") print(protected) "On DATE_1" print(entities[0].text) "15/08/2025"
restore(text, language)
¶
Convert entity placeholders back to spoken form.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text with entity placeholders |
required |
language
|
str
|
Target language ('en' or 'ms') |
required |
Returns:
| Type | Description |
|---|---|
str
|
Text with entities converted to spoken form |
Example
extractor = EntityExtractor() protected, entities = extractor.extract("15/08/2025") processed = process_text(protected) result = extractor.restore(processed, "en") print(result) "fifteenth of August two thousand and twenty-five"
Constructor¶
Creates a new extractor instance with pre-compiled patterns for all entity types.
extract(text, enabled_entities)¶
extract(text, enabled_entities=None)
¶
Extract all entities from text and replace with placeholders.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text to extract entities from |
required |
enabled_entities
|
Optional[list[EntityType]]
|
List of entity types to extract (None = all) |
None
|
Returns:
| Type | Description |
|---|---|
tuple[str, list[Entity]]
|
Tuple of (protected_text, entities_list) |
Example
extractor = EntityExtractor() protected, entities = extractor.extract("On 15/08/2025") print(protected) "On DATE_1" print(entities[0].text) "15/08/2025"
Extract entities from text and replace them with placeholders.
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
text |
str |
required | Input text to extract entities from |
enabled_entities |
list[EntityType] | None |
None |
List of entity types to extract. None extracts all types. |
Returns: tuple[str, list[Entity]] — A tuple of:
str— The protected text with entities replaced by<<<TYPE_ID>>>placeholderslist[Entity]— The list of extractedEntityobjects, sorted by position
Placeholder format: <<<TYPE_ID>>>
TYPEis the uppercase entity type name (e.g.,URL,DATE,CURRENCY)IDis a sequential integer starting from 1
Example placeholders: <<<URL_1>>>, <<<DATE_2>>>, <<<CURRENCY_3>>>
extractor = EntityExtractor()
protected, entities = extractor.extract(
"Send email to user@example.com about RM500 on 15/08/2025"
)
# protected: "Send email to <<<EMAIL_1>>> about <<<CURRENCY_2>>> on <<<DATE_3>>>"
# entities contains 3 Entity objects with original text and positions
restore(text, language)¶
restore(text, language)
¶
Convert entity placeholders back to spoken form.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text with entity placeholders |
required |
language
|
str
|
Target language ('en' or 'ms') |
required |
Returns:
| Type | Description |
|---|---|
str
|
Text with entities converted to spoken form |
Example
extractor = EntityExtractor() protected, entities = extractor.extract("15/08/2025") processed = process_text(protected) result = extractor.restore(processed, "en") print(result) "fifteenth of August two thousand and twenty-five"
Convert entity placeholders back to spoken form.
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
text |
str |
required | Text with entity placeholders |
language |
str |
required | Target language ("en" or "ms") |
Returns: str — Text with all entity placeholders replaced by their spoken form.
extractor = EntityExtractor()
protected, _ = extractor.extract("Price is RM500")
# protected: "Price is <<<CURRENCY_1>>>"
# ... other processing on protected text ...
result = extractor.restore(protected, "ms")
# result: "Price is lima ratus ringgit"
entities attribute¶
list[Entity] — The list of entities found during the most recent extract() call. This list is cleared and repopulated each time extract() is called.
EntityType¶
EntityType
¶
Bases: str, Enum
Types of entities that can be extracted.
Enum of all entity types that can be extracted. Each value is a lowercase string.
| Type | Value | What It Detects |
|---|---|---|
URL |
"url" |
HTTP/HTTPS/FTP URLs, IP addresses with ports, domain names |
EMAIL |
"email" |
Email addresses (user@example.com) |
DATE |
"date" |
DD/MM/YYYY, YYYY-MM-DD, DD Month YYYY, Month DD, YYYY |
TIME |
"time" |
HH:MM, HH:MM AM/PM, HH:MM:SS |
CURRENCY |
"currency" |
RM, USD, EUR, GBP, MYR, $, £, € with amounts and optional K/M/B/T suffixes |
FRACTION |
"fraction" |
Numeric fractions (3/4, 10/4) — excludes date-like patterns |
TEMPERATURE |
"temperature" |
Temperature values (25C, -5F, 100K, 37.5C) |
IC |
"ic" |
Malaysian IC numbers (YYMMDD-SS-NNNN) |
HARI_BULAN |
"hari_bulan" |
Malay day-of-month format (e.g., 15hb) |
HIJRI |
"hijri" |
Hijri years (e.g., 1446H) |
X_KALI |
"x_kali" |
Multipliers (e.g., 5x, 10X) |
from revo_norm.entity_extractor import EntityType
# Access entity type values
EntityType.URL # <EntityType.URL: 'url'>
EntityType.CURRENCY # <EntityType.CURRENCY: 'currency'>
EntityType.DATE # <EntityType.DATE: 'date'>
Entity¶
Entity
dataclass
¶
Represents an extracted entity from text.
Attributes:
| Name | Type | Description |
|---|---|---|
type |
EntityType
|
Entity type (URL, email, date, etc.) |
text |
str
|
Original matched text |
start |
int
|
Start position in original text |
end |
int
|
End position in original text |
metadata |
dict
|
Additional information (language, format, etc.) |
placeholder_id |
int
|
ID for placeholder generation |
Dataclass representing an extracted entity.
| Field | Type | Description |
|---|---|---|
type |
EntityType |
The entity type (URL, EMAIL, DATE, etc.) |
text |
str |
The original matched text from the input |
start |
int |
Start position in the original text |
end |
int |
End position in the original text |
metadata |
dict |
Additional information (language, format, etc.) — defaults to {} |
placeholder_id |
int |
ID used for placeholder generation — defaults to 0 |
from revo_norm.entity_extractor import Entity, EntityType
entity = Entity(
type=EntityType.URL,
text="https://example.com",
start=5,
end=24,
placeholder_id=1,
)
entity.type # EntityType.URL
entity.text # "https://example.com"
entity.placeholder_id # 1
Extraction Order¶
Entities are extracted in the following order (most specific first to prevent false matches):
- URL — Must be first to prevent IP addresses from being matched by other patterns
- EMAIL — Must be early to prevent
@from being replaced by special chars - CURRENCY — Must be early to protect currency codes from acronym/abbreviation expansion
- DATE — Protects slash-format dates from fraction pattern
- TIME — Protects colon-format times from other processing
- TEMPERATURE — Matches
25C,-5F, etc. - FRACTION — Matches
3/4,10/4(with negative lookahead to skip dates) - X_KALI — Matches
5x,10X - IC — Matches Malaysian IC format
YYMMDD-SS-NNNN - HARI_BULAN — Matches
15hb,3HB - HIJRI — Matches
1446H,1445h