Version 1.1. Note: Corpora are regulary improved and transcribers retrained.
An automated phonetic/phonemic transcriber supporting English, German, and Danish. Outputs transcriptions in the International Phonetic Alphabet IPA or the SAMPA alphabet designed for speech recognition technology.
Note: The German transcriber has been trained on dual word forms in cases their spellings were changed due to the orthography reform of 1996 (e.g both "muss" and "muß", "Riss" and "Riß" etc.)
|GERMAN CORPUS (unique word forms):||22822|
|Training/test 100%/100%||words correct: 21775,||words incorrect: 1047||= 95.41 % correct|
|Phoneme Error Rate (PHER)=||S+D+I/N *=||190+850+123/195264||= 0.00596|
|Training/test 90%/10%||words correct: 1716,||words incorrect: 566||= 75.20 % correct|
|Phoneme Error Rate (PHER)=||S+D+I/N *=||79+663+96/19512||= 0.04295|
|Training/test 80%/20%||words correct: 3411,||words incorrect: 1153||= 74.74 % correct|
|Phoneme Error Rate (PHER)=||S+D+I/N *=||201+1282+201/39011||= 0.04317|
S is the number of phoneme substitutions,
D is the number of phoneme deletions,
I is the number of phoneme insertions,
N is the number of phonemes in the reference (test set)
The tool is lexicon-free and is based purely on predictive modelling derived from decision analysis. It is 100 % data-driven, i.e. the underlying decision tree has been generated automatically from data containing orthographic forms and their phonemic counterparts .
The transcription tool is not error free. For "normal" native words it mostly produces correct results, however for words of foreign origin, some proper names, abbreviations etc. it often fails. Other systems resort to hybrid solutions where a lexicon with "exceptions" is combined with predictive mapping based on decision trees, neural networks, or similar technologies.
 The data-driven, predictive model is suited only for languages with alphabetic orthografies (where one grapheme largely corresponds to one phone). This excludes languages like Chinese (with a syllable based orthography) and Hebrew (consonantal orthography). Moreover, for languages with alphabetic orthographies the problem of mapping graphemic symbols to phonemic ones does not have equal complexity. There are extremely "easy" languages like Turkish where the problem largely can be solved simply by substituting orthographic symbols with phonemic ones without considering the context. And there are "difficult" languages like Danish where certain historical sound changes (weakening of plosives and lowering of vowels in certain contexts etc.) have resulted in a complex relation between orthography and pronunciation.