Thursday, May 2, 2013

Data collection funded; moving GSoC to Moodle

Two big news items: we have a sponsor for the data collection effort, and in the 2013 Google Summer of Code, we're going to try to integrate with Moodle. More soon.

Saturday, November 17, 2012

UPDATED: Status update

I wish that this were a progress report instead of a status update, but so far we haven't raised enough to begin data collection with Mechanical Turk.  We have had a paper accepted for publication and we are trying to get in to the Google Compute Engine to save expenses for the huge Amazon bill for asking people who claim to have good pronunciation and reading skill to record exemplars. The problem is that the number of such exemplars needs to be relatively large. For those of you familiar with the TalkNicer demo, this is the "exemplar sufficiency index" and it needs to meet a certain threshold for at least 5,000 words of instructional material before I feel comfortable committing to an expensive data collection effort.

So in summary, please donate more, or if you have already donated, please ask multiple people to at least match your donation. It will be worth it.

Update: How much more do we need? About $4,000 based on the preliminary per-phoneme exemplar sufficiency index including English homographs and Mechanical Turk performance expectation estimates. Also updated: cmusphinx.sourceforge.net/wiki/pronunciation_evaluation

Further update: I am very sorry about delaying Troy's posts here (it was due to the WebRTC and related questions) but they have been available at e.g. cmusphinx.sourceforge.net/2012/08/gsoc-2012-pronunciation-evaluation-troy-project-conclusions

Sunday, August 26, 2012

Ronanki: GSoC 2012 Pronunciation Evaluation: Summary and Conclusions

This article briefly summarizes the implementation of GSoC 2012 Pronunciation Evaluation project.

Primarily, I started with sphinx forced-alignment and obtained the spectral matching acoustic scores, duration at phone, word level using WSJ models. After that I tried concentrating mainly on two things. They are edit-distance neighbor phones decoding and Scoring routines for both Text-dependent and Text-independent systems as a part of GSoC 2012 project.

Edit-distance Neighbor phones decoding:

1. Primarily started with single-phone decoder and then explored three-phones decoder, word decoder and complete phrase decoder by providing neighbor phones as alternate to the expected phone.
2. The decoding results shown that both word level and phrase level decoding using JFGF are almost same.
3. This method helps to detect the mispronunciations at phone level and to detect homographs as well if the percentage of error in decoding can be reduced.

Scoring Routines:

Text-dependent: 
This method is based on exemplars for each phrase. Initially, mean acoustic score, mean duration along with deviations are calculated for each of the phone in the phrase based on exemplar recordings. Now, given the test recording, each phone in the phrase is then compared with exemplar statistics. After that, z-scores are calculated and then normalized scores are calculated based on maximum and minimum of z-scores from exemplar recordings. All phone scores are aggregated to get word score and then all word scores are aggregated with POS weight to get complete phrase score.

Text-independent:
This method is based on predetermined statistics built from any corpus. Here, in this project, I used TIMIT corpus to build statistics for each phone based on its position (begin/middle/end) in the word. Given any random test file, each phone acoustic score, duration is compared with corresponding phone statistics based on contextual information. The scoring method is same as to that of Text-dependent system.

Demo:
Please try our demo @ http://talknicer.net/~ronanki/test/ and help us by giving the feedback.

Documentation and codes:
All codes are uploaded in cmusphinx svn @
http://sourceforge.net/p/cmusphinx/code/HEAD/tree/branches/speecheval/ronanki/ and raw documentation of the project can be found here.

Conclusions:
The pronunciation evaluation system really helps all users to improve their pronunciation by trying multiple times and it lets you correct your-self by giving necessary feedback at phone, word level. I couldn't complete some of the things I have mentioned earlier during the project. But I hope I can keep my contributions to this project in future also.

This summer has been a great experience to me. Google Summer of code 2012 has finally ended. I would like to thank my mentor James Salsman for his time, continuous efforts and help. The way he motivated really helped me to focus on the project all the time. I would also like to thank my friend Troy Lee, Nickolay, Biksha Raj for their help and comments during the project time. 

Wednesday, August 22, 2012

GSoC 2012: #Troy Pronunciation Evaluation Week 7 Status


Last week, I was still working on the data collection website.

Thank Robert (butler1970@gmail.com) so much for trying out the website and listed the issues he encountered on this page: https://www.evernote.com/pub/butler1970/cmusphinx#b=11634bf8-7be9-479f-a20e-6fa1e54b322b&n=398dc728-b3f0-4ceb-8ccf-89295b98a6d7

Issue #1: The under construction of Student Page

The first stage of the website to collect exemplar recordings, thus the student page is not implemented at that time. 

Issue #2: The inconvenient birthdate control

The birthdate control is now replaced with the standard HTML5 <input type="datetime"> control. Due to the datetime input control is a new element in HTML5, currently only Chrome, Safari and Opera support the popup date selection. On other browsers, which have on support yet, the control will simply be displayed as an input box. The user can just type in the date and the background script will check whether the format is correct or not.

Issue #3: The incorrect error message "Invalid date format" on the additional information update page

After digging into the source code to find the problem for several hours, the bug lies in the order of invoking mysql related functions. The processing steps in the additional information update page is as follows:
a) client side post the user input information to the server;
b) server side first using mysql_escape_string function to preprocess the user information to ensure the security of later mysql queries;
c) check the format of each field including the date time format, whether the user inputs a valid date;
d) update the mysql database with the new information.
As only in step d) the mysql sever action is needed, I thus put the database connection code behind step c), without knowing the mysql_escape_string function also requires mysql database connection. In the previous implementation, the mysql_escape_string returns empty string thus leads to invalid date format. 

Secondly, the exemplar recording page is update with following features:
1) Automatically move to the next utterance after the user record and playback the current recording;
2) Adding extra navigation control for recording phrase selection;
3) When the user opens the exemplar recording page, the first un-recorded utterance will be set to the first one shown the user.
4) Connection the enable and disable of recording and playback buttons of the player with the database information, i.e. if the user has recorded the phrase before, both the recording and playback buttons are enabled, otherwise only recording is allowed.

The third major part done in last week is the student page which is previously left empty.
For the student page, users now can also practice their pronunciation by recording the phrases in the database and also listening to the exemplar recordings in the system. The features are:
1) Full recording and playback functionalities as exemplar recording;
2) When navigating to each phrase, randomly maximum 5 exemplar recordings from the system are retrieved from the database and listed on the page to help the students. 
3) Additionally, to put some exemplar recordings in the system, I have to manually transcribe several sentences and put the recordings into the system for use. After there are many people contributing to the exemplar recordings, I don't need to do manually transcription any more.

For this week, two major tasks to be done: integration with Ronanki's evaluation scripts and mid-term report. 

Tuesday, August 21, 2012

Ronanki: GSoC 2012 Pronunciation Evaluation Final week Report

Here comes my final report for Pronunciation Evaluation project. The demo system is little bit modified. You can give a try and test the text-independent system @ http://talknicer.net/~ronanki/test

Last week, I tested the system with both Indian accent and US accent. For US accent, I don't have any mis-pronunciation data. I just tested with SA1, SA2 (TIMIT) sentences. For Indian accent, I prepared a data with both correct pronunciations and mis-pronunciations and can be downloaded at http://talknicer.net/~ronanki/Database.tar.tgz

The results are provided at http://talknicer.net/~ronanki/results/. The scripts for evaluating the database are uploaded in svn project folder. Phonological features are provided in svn, but couldn't built models with it in time.

The project and the required scripts can be downloaded from
http://sourceforge.net/p/cmusphinx/code/HEAD/tree/branches/speecheval/ronanki/
Please go through README files provided in each folder.

Finally, I would like to thank my mentor James Salsaman, Nickolay, Biksha Raj and rest of the community for helping me all the time. I hope that I keep contributing to this project over the time. 

Ronanki: GSoC 2012 Pronunciation Evaluation week 12

This week, I am trying to extend the TIMIT statistics to 5 or 6 per each phoneme based on syllable position or I can do CART modelling to predict duration and acoustic score based on training. I did this to some extent using wagon in speech tools.

Regarding mis-pronunciation detection accuracy, I collected data from 8 non-native speakers with 5 words being recorded 10 times in both correct and wrong ways and 5 sentences being recorded 3-5 times in both correct and wrong ways. Here is the link to it @ http://researchweb.iiit.ac.
in/~srikanth.ronanki/GSoC/PE_
database/ and the description of the database is here at http://researchweb.iiit.ac.in/~srikanth.ronanki/GSoC/PE_database/description.txt

I need to split each speaker's data into individual files which is a tedious task and taking some time. Somehow, I completed with one speaker's data and the current text-independent system is doing good. 46 out of 50 correct words are detected good pronunciation and 42 words out of 50 wrong words are detected mis-pronunciation by setting a common threshold for all words. It takes one or two more days to give complete statistics. 

In parallel, I completed phonological features and generated acoustic models for TIMIT database because I faced some difficulties to find complete set of wav files for WSJ database. But, I failed in both decoding and forced-alignment with the new models generated on phonological features. Even I failed in generating appropriate models with sphinx mfc features. Even though they generated properly, I didn't get results with forced-alignment or decode functions by replacing with WSJ models. I will try to overcome these issues by next week.

Ronanki: GSoC 2012 Pronunciation Evaluation Week 11

This week, I managed to do only data collection which is required to evaluate the project.

The database collection is over and is on different servers. I am trying to bring it on-to one place. You can find part amount of the data for one speaker here @ http://researchweb.iiit.ac.in~srikanth.ronanki/GSoC/PE_database/Sru/

The description of the data is at http://researchweb.iiit.ac.in/~srikanth.ronanki/GSoC/PE_database/description.txt