Automatic Phonemic Transcriber

Version 1.1. Note: Corpora are regulary improved and transcribers retrained.

Input text for phonemic transcription

Select language: Select alphabet:   

An automated phonetic/phonemic transcriber supporting English, German, and Danish. Outputs transcriptions in the International Phonetic Alphabet IPA or the SAMPA alphabet designed for speech recognition technology.

DANISH CORPUS (unique word forms): 39310
Training/test 100%/100% words correct: 37240, words incorrect: 2070 = 94.73 % correct
Phoneme Error Rate (PHER)= S+D+I/N *= 668+1326+280/280912 = 0.00810
Training/test 90%/10% words correct: 2841, words incorrect: 1090 = 72.27 % correct
Phoneme Error Rate (PHER)= S+D+I/N *= 232+1119+169/28152 = 0.05399
Training/test 80%/20% words correct: 5723, words incorrect: 2139 = 72.79 % correct
Phoneme Error Rate (PHER)= S+D+I/N *= 448+2205+332/56194 = 0.05312

S is the number of phoneme substitutions,
D is the number of phoneme deletions,
I is the number of phoneme insertions,
N is the number of phonemes in the reference (test set)

View performance statistics for Danish, German, English

The tool is lexicon-free and is based purely on predictive modelling derived from decision analysis. It is 100 % data-driven, i.e. the underlying decision tree has been generated automatically from data containing orthographic forms and their phonemic counterparts [1].

The transcription tool is not error free. For "normal" native words it mostly produces correct results, however for words of foreign origin, some proper names, abbreviations etc. it often fails. Other systems resort to hybrid solutions where a lexicon with "exceptions" is combined with predictive mapping based on decision trees, neural networks, or similar technologies.

[1] The data-driven, predictive model is suited only for languages with alphabetic orthografies (where one grapheme largely corresponds to one phone). This excludes languages like Chinese (with a syllable based orthography) and Hebrew (consonantal orthography). Moreover, for languages with alphabetic orthographies the problem of mapping graphemic symbols to phonemic ones does not have equal complexity. There are extremely "easy" languages like Turkish where the problem largely can be solved simply by substituting orthographic symbols with phonemic ones without considering the context. And there are "difficult" languages like Danish where certain historical sound changes (weakening of plosives and lowering of vowels in certain contexts etc.) have resulted in a complex relation between orthography and pronunciation.