Sunday, June 10, 2012

Ronanki: GSoC 2012 Pronunciation Evaluation Week 2

[It is my fault this update is late, not Ronanki's. --James Salsman]

Following last week's discussion describing how to obtain phoneme acoustic scores from sphinx3_align, here is some additional detail pertaining to two of the necessary output arguments:

1. Following up on the discussion at https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/4583225, I was able to produce acoustic scores for each frame, and thereby also for each phoneme in a single recognition pass.  Add the following code to the write_stseg function in main_align.c and use the state segmentation parameter -stsegdir as an argument to the program:

    char str2[1024];
    align_stseg_t *tmp1;

    for (i = 0, tmp1 = stseg; tmp1; i++, tmp1 = tmp1->next) {
        mdef_phone_str(kbc->mdef, tmp1->pid, str2);
        fprintf(fp, "FrameIndex %d Phone %s PhoneID %d SenoneID %d state %d Ascr %11d \n",
            i, str2, tmp1->pid, tmp1->sen, tmp1->state, tmp1->score);
    }

2. By using the phone segmentation parameter -phsegdir as an argument to the program, the acoustic scores for each phoneme can be calculated. The output sequence for the word "approach" is as follows:

         SFrm  EFrm   SegAScr       Phone
            0     9    -64725       SIL
           10    21    -63864       AH SIL P b
           22    33   -126819       P AH R i
           34    39    -21470       R P OW i
           40    51    -69577       OW R CH i
           52    64    -55937       CH OW DH e
Each phoneme in the "Phone" column is represented as <Aligned_phone> <Previous_phone> <Next_phone> <position_in_the_word (b-begin, i-middle, e-end)>.  The full command line usage for this output is:

$ sphinx3_align -hmm wsj_all_cd30.mllt_cd_cont_4000 -dict cmu.dic -fdict phone.filler -ctl phone.ctl -insent phone.insent -cepdir feats -phsegdir phonesegdir -phlabdir phonelabdir -stsegdir statesegdir -wdsegdir aligndir -outsent phone.outsent

Work in progress:

1. It's very important to weight word scores by the words' part of speech (articles don't matter very much if they are omitted, but nouns, adjectives, verbs, and adverbs are the most important.) Troy has designed a basic database schema at http://talknicer.net/w/Database_schema in which the part of speech is one of the fields in the "prompts" table along with acoustic and duration standard scores in the "scores" table. 

2. I put some exemplar recordings for three phrases the project mentor had collected at http://talknicer.net/~ronanki/Datasets/ in each subdirectory there for each of the three phrases.  The description of the phrases is at http://talknicer.net/~ronanki/Datasets/files/phrases.txt.

3. I ran sphinx3_align for that sample data set. I wrote a program to calculate mean and standard deviations of phoneme acoustic scores, and the mean duration of each phoneme. I also generated neighbor phonemes for each of the phrases, and the output is written in this file: http://talknicer.net/~ronanki/Datasets/out_ngb_phonemes.insent

4. I also tried some of the other sphinx3 executables such as sphinx3_decode, sphinx3_livepretend, and sphinx3_continous for mispronunciation detection. For the sentence, "Approach the teaching of pronunciation with more confidence." (phrase 1), I used this command:

$ SPHINX3DECODE -hmm ${WSJ} -fsg phone.fsg -dict basicphone.dic -fdict phone.filler -ctl new_phone.ctl -hyp phone.out -cepdir feats -mode allphone -hypseg phone_hypseg.out -op_mode 2

The decoder, sphinx3_decode, produced this output:

P UH JH DH CH IY CH Y N Z Y EY SH AH W Z AO K AA F AH N Z

The forced alignment system, sphinx3_align, produced this output: 

AH P R OW CH DH AH T IY CH IH NG AH V P R AH N AH N S IY EY SH AH N W IH TH M AO R K AA N F AH D AH N S

The sphinx3_livepretend and sphinx3_continous commands produce output in words using language models and acoustic models along with a complete dictionary of expected words:

approach to teaching opponents the nation with more confidence

Plans for the coming week:

1. Write and test audio upload and pronunciation evaluation for per-phoneme standard scores.

2. Since there are many deletions in the edit distance scoring grammars tried so far, we need to modify the grammar file and/or the method we are using to detect whether neighboring phonemes match more closely. Here is my idea of finding neighboring phonemes by dynamic programming:

a. Run the decoder to get the best possible output

b. Align the decoder output to forced-alignment output using a dynamic programming string matching algorithm 

c. The aligned output will have the same number of phones as from forced alignment. So, we need to test two things for each phoneme:
  • If the phone is same as expected phoneme, no need to do anything
  • If the phone is not as expected phoneme, check that phone in the list of neighboring phonemes of the expected phoneme.

d. Then, we can run sphinx3_align with this outcome against the same wav file to check whether the acoustic scores actually indicate a better match. 

3. As an alternative to the above, I used sox to split each input wave file in to individual phoneme wav files using the forced alignment phone labels, and then used a separate recognition pass on each tiny speech segment. Now, I am writing separate grammar files for the neighboring phonemes for each phoneme. Once I complete them, I will check the output using decoder for each phoneme segment. This should provide for more accurate assessment of mispronunciations.

4. I will update the wiki here at http://cmusphinx.sourceforge.net/wiki/pronunciation_evaluation with my current tasks and milestones.

No comments:

Post a Comment