Automatic Phonemic Transcriber

Version 1.1. Note: Corpora are regulary improved and transcribers retrained.

Input text for phonemic transcription

Select language: Select alphabet:   

An automated phonetic/phonemic transcriber supporting English, German, and Danish. Outputs transcriptions in the International Phonetic Alphabet IPA or the SAMPA alphabet designed for speech recognition technology.

Note: The German transcriber has been trained on dual word forms in cases their spellings were changed due to the orthography reform of 1996 (e.g both "muss" and "muß", "Riss" and "Riß" etc.)

GERMAN CORPUS (unique word forms): 22822
Training/test 100%/100% words correct: 21775, words incorrect: 1047 = 95.41 % correct
Phoneme Error Rate (PHER)= S+D+I/N *= 190+850+123/195264 = 0.00596
Training/test 90%/10% words correct: 1716, words incorrect: 566 = 75.20 % correct
Phoneme Error Rate (PHER)= S+D+I/N *= 79+663+96/19512 = 0.04295
Training/test 80%/20% words correct: 3411, words incorrect: 1153 = 74.74 % correct
Phoneme Error Rate (PHER)= S+D+I/N *= 201+1282+201/39011 = 0.04317

*)
S is the number of phoneme substitutions,
D is the number of phoneme deletions,
I is the number of phoneme insertions,
N is the number of phonemes in the reference (test set)

View performance statistics for Danish, German, English

The tool is lexicon-free and is based purely on predictive modelling derived from decision analysis. It is 100 % data-driven, i.e. the underlying decision tree has been generated automatically from data containing orthographic forms and their phonemic counterparts [1].

The transcription tool is not error free. For "normal" native words it mostly produces correct results, however for words of foreign origin, some proper names, abbreviations etc. it often fails. Other systems resort to hybrid solutions where a lexicon with "exceptions" is combined with predictive mapping based on decision trees, neural networks, or similar technologies.

[1] The data-driven, predictive model is suited only for languages with alphabetic orthografies (where one grapheme largely corresponds to one phone). This excludes languages like Chinese (with a syllable based orthography) and Hebrew (consonantal orthography). Moreover, for languages with alphabetic orthographies the problem of mapping graphemic symbols to phonemic ones does not have equal complexity. There are extremely "easy" languages like Turkish where the problem largely can be solved simply by substituting orthographic symbols with phonemic ones without considering the context. And there are "difficult" languages like Danish where certain historical sound changes (weakening of plosives and lowering of vowels in certain contexts etc.) have resulted in a complex relation between orthography and pronunciation.