A fast and robust text2num converter library a library for recognizing, parsing and transcribing into digits (base 10) numbers expressed in natural language. It works on strings as well as on custom token lists.
No IA involved: it is low on resources (and energy!) consumption and the latency is very small.
text2num is a python package that provides functions and parser classes for:
- Parsing numbers expressed as natural language words and converting them to integer values.
- Detection of ordinal, cardinal and decimal numbers in a stream of natural language words and get their decimal digit representations.
Supported natural languages (in alphabetical order):
- Dutch;
- English;
- French;
- German;
- Italian;
- Portuguese (Brazilian and European);
- Spanish.
Versions 3.X vs 2.X
This new generation of text2num relies on a new and improved algorithm implemented in Rust whereas the 2.X branch
is in pure python and uses a less capable algorithm and has now been retired.
You don't need Rust to install and run text2num! as we provide precompiled wheels.
Backward incompatible changes:
- dropped support for signed numbers — the feature was broken anyway;
- parsing mode is relaxed by default (i.e. "greedy") — you can use punctuation (e.g. commas) to separate groups in text, or voice pauses if processing Speech-to-Text token streams;
- the
thresholdoptional parameter toalpha2digitsnow applies to both ordinals and cardinals. As a consequence the signature ofalpha2digitshas changed. - the Russian and Catalan languages have not been ported yet.
Installation
We provide pre-compiled wheels for Linux, MacOS and Windows:
| py 3.8 | py 3.9 | py 3.10 | py 3.11 | py 3.12 | py 3.13 | |
|---|---|---|---|---|---|---|
| Linux aarch64 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Linux armv7l | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Linux ppc64le | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Linux s390x | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Linux x86_64 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Linux i686 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| win32 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| win amd64 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| macos 11 | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ |
So all you need to do is this:
pip install text2num
# or, if using an `uv` project:
uv add text2num
Quickstart by example
Not every supported language is covered in these examples, but it gives you an idea.
Parse and convert
French examples:
>>> from text_to_num import text2num >>> text2num('quatre-vingt-quinze', "fr") 95 >>> text2num('nonante-cinq', "fr") 95 >>> text2num('mille neuf cent quatre-vingt dix-neuf', "fr") 1999 >>> text2num('dix-neuf cent quatre-vingt dix-neuf', "fr") 1999 >>> text2num("cinquante et un million cinq cent soixante dix-huit mille trois cent deux", "fr") 51578302 >>> text2num('mille mille deux cents', "fr") Traceback (most recent call last): ... ValueError: invalid literal for text2num: 'mille mille deux cents'
English examples:
>>> from text_to_num import text2num >>> text2num("fifty-one million five hundred seventy-eight thousand three hundred two", "en") 51578302 >>> text2num("eighty-one", "en") 81
Spanish examples:
>>> from text_to_num import text2num >>> text2num("ochenta y uno", "es") 81 >>> text2num("nueve mil novecientos noventa y nueve", "es") 9999 >>> text2num("cincuenta y tres millones doscientos cuarenta y tres mil setecientos veinticuatro", "es") 53243724
Portuguese examples:
>>> from text_to_num import text2num >>> text2num("trinta e dois", "pt") 32 >>> text2num("mil novecentos e seis", "pt") 1906 >>> text2num("vinte e quatro milhões duzentos mil quarenta e sete", "pt") 24200047
German examples:
>>> from text_to_num import text2num >>> text2num("einundfünfzigmillionenfünfhundertachtundsiebzigtausenddreihundertzwei", "de") 51578302 >>> text2num("ein und achtzig", "de") 81
Find and transcribe
Any numbers, even ordinals.
French:
>>> from text_to_num import alpha2digit >>> sentence = ( ... "Huit cent quarante-deux pommes, vingt-cinq chiens, mille trois chevaux, " ... "douze mille six cent quatre-vingt-dix-huit clous.\n" ... "Quatre-vingt-quinze vaut nonante-cinq. On tolère l'absence de tirets avant les unités : " ... "soixante seize vaut septante six.\n" ... "Nombres en série : douze, quinze, zéro zéro quatre, vingt, cinquante-deux, cent trois, cinquante deux, " ... "trente et un.\n" ... "Ordinaux: cinquième troisième vingt et unième centième mille deux cent trentième.\n" ... "Décimaux: douze virgule quatre-vingt dix-neuf, cent vingt virgule zéro cinq ; " ... "mais soixante zéro deux." ... ) >>> print(alpha2digit(sentence, "fr")) 842 pommes, 25 chiens, 1003 chevaux, 12698 clous. 95 vaut 95. On tolère l'absence de tirets avant les unités : 76 vaut 76. Nombres en série : 12, 15, 004, 20, 52, 103, 52, 31. Ordinaux: 5ème 3ème 21ème 100ème 1230ème. Décimaux: 12,99, 120,05 ; mais 60 02. >>> sentence = "Cinquième premier deuxième troisième vingt et unième centième mille deux cent trentième." >>> print(alpha2digit(sentence, "fr")) 5ème 1er 2ème 3ème 21ème 100ème 1230ème.
English:
>>> from text_to_num import alpha2digit >>> text = "On May twenty-third, I bought twenty-five cows, twelve chickens and one hundred twenty five point four zero kg of potatoes." >>> alpha2digit(text, "en") 'On May 23rd, I bought 25 cows, 12 chickens and 125.40 kg of potatoes.' >>> alpha2digit("I finished the race in the twelfth position!", "en") 'I finished the race in the 12th position!'
Spanish:
>>> from text_to_num import alpha2digit >>> text = "Compramos veinticinco vacas, doce gallinas y ciento veinticinco coma cuarenta kg de patatas." >>> alpha2digit(text, "es") 'Compramos 25 vacas, 12 gallinas y 125,40 kg de patatas.' >>> text = "Compramos veinticinco vacas, doce gallinas y ciento veinticinco punto cuarenta kg de patatas." >>> alpha2digit(text, "es") 'Compramos 25 vacas, 12 gallinas y 125.40 kg de patatas.' >>> text = "Ella ha quedado tercera" >>> alpha2digit(text, "es", threshold=0) 'Ella ha quedado 3ª'
Portuguese:
>>> from text_to_num import alpha2digit >>> text = "Comprámos vinte e cinco vacas, doze galinhas e cento e vinte e cinco vírgula quarenta kg de batatas." >>> alpha2digit(text, "pt") 'Comprámos 25 vacas, 12 galinhas e 125,40 kg de batatas.' >>> text = "Ordinais: quinto, terceiro, vigésima, vigésimo primeiro, centésimo quarto" >>> alpha2digit(text, "pt") 'Ordinais: 5º, 3º, 20ª, 21º, 104º'
German:
>>> from text_to_num import alpha2digit >>> text = "Ich habe fünfundzwanzig Kühe, zwölf Hühner und einhundertfünfundzwanzig kg Kartoffeln gekauft." >>> alpha2digit(text, "de") 'Ich habe 25 Kühe, 12 Hühner und 125 kg Kartoffeln gekauft.' >>> text = "Die Telefonnummer lautet dreiunddreißig neun sechzig null sechs zwölf einundzwanzig." >>> alpha2digit(text, "de") 'Die Telefonnummer lautet 33 9 60 06 12 21.' >>> text = "Der zweiundzwanzigste Januar zweitausendzweiundzwanzig." >>> alpha2digit(text, "de") 'Der 22. Januar 2022.' >>> text = "Es ist ein Buch mit dreitausend Seiten aber nicht das erste." >>> alpha2digit(text, "de", threshold=0) 'Es ist 1 Buch mit 3000 Seiten aber nicht das 1..' >>> text = "Pi ist drei Komma eins vier und so weiter, aber nicht drei Komma vierzehn :-p" >>> alpha2digit(text, "de", threshold=0) 'Pi ist 3,14 und so weiter, aber nicht 3 Komma 14 :-p'
Working with tokens
Imagine that we have an ASR application that returns a transcript as a list of tokens (text, start timestamp, end timestamp) where the timestamps are integers representing milliseconds relative to the beginning of the speech.
from text_to_num import (Token, find_numbers) class DecodedWord(Token): def __init__(self, text, start, end): self._text = text self.start = start self.end = end def text(self): return self._text def nt_separated(self, previous): # we consider a voice gap of more that 100 ms as significant return self.start - previous.end > 100 # Let's simulate ASR output stream = [ DecodedWord("We", 0, 100), DecodedWord("have", 100, 200), DecodedWord("respectively", 200, 400), DecodedWord("twenty", 400, 500), DecodedWord("nine", 610, 700), DecodedWord("and", 700, 800), DecodedWord("thirty", 800, 900), DecodedWord("four", 950, 1000), DecodedWord("dollars", 1010, 1410) ] occurences = find_numbers(stream, "en") for num in occurences: print(f"found number {num.text} ({num.value}) at range [{num.start}, {num.end}] in the stream")
When executed, that code snippet prints::
found number 20 (20.0) at range [3, 4] in the stream
found number 9 (9.0) at range [4, 5] in the stream
found number 34 (34.0) at range [6, 8] in the stream
Read the complete documentation on ReadTheDocs.
Contribute
Join us on https://github.com/allo-media/text2num