Automatic Phonemic Transcriber

Danish Version 1.3. New: non-syllabic vowels are in SAMPA now marked with postfixed diacritic '^' (u^, i^, 6^). The transcriber now destinguisches between the full vowel Q and the unstressed center-vowel 6. Performance has been improved. Support for exceptions.

This free version only allows 40 word transcriptions per submission. Login / obtain license to get unrestricted access.

Input text for phonemic transcription

Output transcription only text-transcription lines word-transcription column
Select language: Select alphabet:   

An automated phonetic/phonemic transcriber supporting English, German, and Danish. Outputs transcriptions in the International Phonetic Alphabet IPA or the SAMPA alphabet designed for speech recognition technology.

View performance statistics for Danish, German, English

The transcription tool is based on a Decision Tree derived from a training lexicon (a list of orthographic forms and their phonemic counterparts). It doesn't look up words in a lexicon, but transcribes in accordance with the general rules it "learned" from the training lexicon. More specifically, the Decision Tree is some machine generated code that decides how a grapheme should be transcribed phonemically given its left and right context. It has been generated by a program that based on an Expectation–Maximization algorithm aligns graphemes and phonemes of the training lexicon and subsequently based on the alignments builds the tree structure [1].

The transcription tool is not error free. For "normal" native words it mostly produces correct results, however for words of foreign origin, some proper names, abbreviations etc. it often fails. Other systems may be mainly lexicon-based and only resort to machine-generated transcriptions when words are not found in the lexicon. Since version 1.2, the present system utilizes exception lists, small "lexica" with words (typically of foreign origin) that cannot be transcribed properly even if they are included in the training lexicon.

[1] The data-driven, predictive model is suited only for languages with alphabetic orthografies (where one grapheme largely corresponds to one phoneme). This excludes languages like Chinese (with a syllable based orthography) and Hebrew (consonantal orthography). Moreover, for languages with alphabetic orthographies the problem of mapping graphemic symbols to phonemic ones does not have equal complexity. There are extremely "easy" languages like Turkish where the problem largely can be solved simply by substituting orthographic symbols with phonemic ones without considering the context. And there are "difficult" languages like Danish where certain historical sound changes (weakening of plosives and lowering of vowels in certain contexts etc.) have resulted in a complex relation between orthography and pronunciation.