Rewrite of vocabcompiler by Holzhaus · Pull Request #181 · jasperproject/jasper-client

Holzhaus · 2014-09-17T17:07:34Z

EDIT: Just scroll down to this comment to see what this PR does. No need to read the whole conversation.

Separated generic code from pocketsphinx logic
Using OOP so that code can be reused for other STT engines, too
The pocketsphinx part of vocabcompiler now uses the cmuclmtk wrapper libary for compilation of the languagemodel/dictionary.
A revision check has been implemented, so that vocabulary won't get recompiled if there's no need.
Proper integration into jasper.py, client/stt.py and client/test.py is still missing due to pending pull requests that change these modules.

Please have a look and execute this module directly:

$ JASPER_HOME=~/Projekte python2 client/vocabcompiler.py --base-dir=/tmp 
Module phrases:    ['BIRTHDAY', 'EMAIL', 'FACEBOOK', 'FIRST', 'HACKER', 'INBOX', 'JOKE', 'KNOCK KNOCK', 'LIFE', 'MEANING', 'MUSIC', 'NEWS', 'NO', 'NOTIFICATION', 'OF', 'SECOND', 'SPOTIFY', 'THIRD', 'TIME', 'TODAY', 'TOMORROW', 'WEATHER', 'YES']
Vocabulary in:     /tmp/pocketsphinx-vocabulary/default
Revision file:     /tmp/pocketsphinx-vocabulary/default/revision
Compiled revision: None
Is compiled:       False
Matches phrases:   False
Compiling...
<snip>
idngram2lm : Done.

Vocabulary in:     /tmp/pocketsphinx-vocabulary/default
Revision file:     /tmp/pocketsphinx-vocabulary/default/revision
Compiled revision: 8c0e66f387b7177205726ecfcb3edd9df0d23653
Is compiled:       True
Matches phrases:   True

$ JASPER_HOME=~/Projekte python2 client/vocabcompiler.py --base-dir=/tmp
Module phrases:    ['BIRTHDAY', 'EMAIL', 'FACEBOOK', 'FIRST', 'HACKER', 'INBOX', 'JOKE', 'KNOCK KNOCK', 'LIFE', 'MEANING', 'MUSIC', 'NEWS', 'NO', 'NOTIFICATION', 'OF', 'SECOND', 'SPOTIFY', 'THIRD', 'TIME', 'TODAY', 'TOMORROW', 'WEATHER', 'YES']
Vocabulary in:     /tmp/pocketsphinx-vocabulary/default
Revision file:     /tmp/pocketsphinx-vocabulary/default/revision
Compiled revision: 8c0e66f387b7177205726ecfcb3edd9df0d23653
Is compiled:       True
Matches phrases:   True

As you can see, the instance detects that has already been compiled and that it's still up-to-date, so that recompilation is not neccessary and therefore skipped.

Holzhaus · 2014-09-18T20:47:09Z

For the unittest section, i'll implementiert a dummy subclass.

I guess __init__() and compile() should happen inside the __init__() (or get_config()) method of the according STT engine.

Holzhaus · 2014-09-25T15:48:01Z

The prevents unneccessary vocabulary recompilation and therefore saves CPU time and SD card write cycles, therefore I guess this can be considered kind of urgent.

I'd like to start to integrate this as soon as #176 is merged.

Holzhaus · 2014-09-25T19:39:12Z

I rebased to include the changes from PR #197.

alexsiri7 · 2014-09-26T05:23:53Z

assertFalse

Holzhaus · 2014-09-26T15:39:55Z

Bonus: the python cmuclmtk library won't clutter our screen with output messages we don't want to see. You can see the difference, when you call python2 client/vocabcompiler.py and python2 client/vocabcompiler.py --debug.

Holzhaus · 2014-09-26T15:55:21Z

Rebased to upstream/master.

Holzhaus · 2014-09-27T14:48:48Z

The vocabcompiler should now be working, but still needs testing.

charliermarsh · 2014-10-01T00:26:29Z

This can just be if cmd_exists(cmd), I think

charliermarsh · 2014-10-01T00:29:44Z

Haven't read through it all, but my tests are failing. I think the TestG2P tests need to be refactored to use OOO.

Holzhaus · 2014-10-01T14:23:27Z

In case we merge PR #209, I'd like to change this (or create a follow-up PR to this one):

If you look at jasper.py and client/mic.py, we're using the Mic class like this:

class Mic:
    def__init__(self, speaker, passive_stt_engine, active_stt_engine):

So we pass in two different STT Engine instances:

An STT Engine instance for passive listen
An STT Engine Instance for active listen

But PocketsphinxSTT takes up to 3 lm/dict pairs:

A pair for passive listen
A pair for active listen
3 A pair for active listen (musicmode)

This is a lot duplication, because in jasper.py, we're creating two separate SST Engine Instances for active listen and passive listen, so that the active listen STT Engine instance will only use the second lm/dict pair and the passive listen STT instance will only use the first pair.
In MusicMode, a third STT instance will be created that only uses the third pair.

Given the case that someone wants to write a new module that also has the ability to start a mode like the MusicMode, he'd either have to hijack the music lm/dict pair, or he'd need to add a new mode to STT Engines and change the PocketsphinxSTTEngine code.

Thus, I'd like to simplify the STT engine dramatically by removing two of the three dict pairs (or rather Vocabulary Instances) from the PocketsphinxSTT engine and the mode parameter in transcribe accordingly, so that you simply do this:

# This is just for demonstration purposes (not the actual code)
# normal operation
passive_vocabulary = PocketsphinxVocabulary(name='keyword')
passive_vocabulary.compile(vocabcompiler.get_keyword_phrases())
passive_stt_engine = PocketsphinxSTT(passive_vocabulary)

active_vocabulary = PocketsphinxVocabulary(name='default')
active_vocabulary.compile(vocabcompiler.get_all_phrases())
active_stt_engine = PocketsphinxSTT(active_vocabulary)

mic = Mic(speaker, passive_stt_engine, active_stt_engine)

# now let's create a new mode in a module
plugin_name = "mpdcontrol"
plugin_phrases = ['PLAY', 'PAUSE', 'STOP', ...]

music_vocabulary = PocketsphinxVocabulary(name=plugin_name)
music_vocabulary.compile(phrases)
music_stt_engine = PocketsphinxSTT(music_vocabulary)
mic = Mic(speaker, passive_stt_engine, music_stt_engine)

@crm416 What do you think?

Holzhaus · 2014-10-03T13:06:39Z

OK, I rebased to fix merge conflicts with PR #209. G2P testcases are now fixed, so that they should be a.... pronounced success ;-) *_ba dum tss_* Additionally, I made some style fixes.

Please test. My suggestion (above) will be part of a future pull request.

Holzhaus · 2014-10-06T18:06:40Z

Anyone willing to test this?

This should raise vocabcompiler test coverage to 100%. Whohooo!

Holzhaus · 2014-10-08T18:53:59Z

Rebased to lastest upstream/master.

coveralls · 2014-10-08T18:56:50Z

Coverage increased (+9.09%) when pulling bd9e667 on Holzhaus:vocabcompiler-abstraction into ea91735 on jasperproject:master.

Holzhaus · 2014-10-08T18:57:41Z

Just an update on what this PR does:

makes recompilation of a vocabulary unneccessary if phrases didn't change
therefore saves CPU time and SD card write cycles
combines lm/dict arguments for PocketsphinxSTT
creates an abstract vocabulary class, so that new vocabulary types (e.g. HTK/Julius or the like) can be implemented easily will full revision/recompilation and error handling support
hides unwanted CMUCLMTK output from the user (unless the loglevel is set to DEBUG)
uses tempfiles for compilation and thus fixes Why must everything be world writeable? #119 (world writable app dir)
improves phonetisaurus-g2p wrapper code
adds support for Phonetisaurus-G2P's nbest argument via profile.yml
raises overall unittest coverage by over 9%
raises vocabcompiler.py unittest coverage to 100%
have I forgotten something?

IMO, we're ready for merging, but I'd like someone else to test this. Please.

Holzhaus · 2014-10-09T01:13:29Z

@crm416 @shbhrsaha Can you please test this?

shbhrsaha · 2014-10-09T18:40:56Z

Sure I'll test it this weekend and report back!

Shubhro

On Wed, Oct 8, 2014 at 9:13 PM, Jan Holthuis notifications@github.com
wrote:

@crm416 @shbhrsaha Can you please test this?

Reply to this email directly or view it on GitHub:
#181 (comment)

Holzhaus · 2014-10-10T11:28:29Z

Thanks!

shbhrsaha · 2014-10-13T04:25:00Z

LGTM. Slick feature, well done!

Rewrite of vocabcompiler

Holzhaus added the enhancement label Sep 17, 2014

Holzhaus mentioned this pull request Sep 19, 2014

STT engine: Julius support #185

Closed

Holzhaus force-pushed the vocabcompiler-abstraction branch from 5d69f3f to 485fde7 Compare September 24, 2014 17:39

Holzhaus force-pushed the vocabcompiler-abstraction branch from aa78551 to 542b870 Compare September 25, 2014 19:37

alexsiri7 reviewed Sep 26, 2014
View reviewed changes

Holzhaus force-pushed the vocabcompiler-abstraction branch from 81860c2 to ea2f0d2 Compare September 26, 2014 15:54

Holzhaus force-pushed the vocabcompiler-abstraction branch from ea2f0d2 to 1300a63 Compare September 27, 2014 14:06

Holzhaus mentioned this pull request Sep 28, 2014

Use config dir for non-temporary writable files #203

Merged

Holzhaus added bug needstesting and removed enhancement labels Sep 28, 2014

charliermarsh reviewed Oct 1, 2014
View reviewed changes

Comment thread client/g2p.py Outdated

Copy link
Copy Markdown

charliermarsh Oct 1, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be if cmd_exists(cmd), I think

Holzhaus force-pushed the vocabcompiler-abstraction branch 2 times, most recently from 2cb471a to 0fdb9a9 Compare October 3, 2014 12:43

Holzhaus force-pushed the vocabcompiler-abstraction branch 2 times, most recently from aecdb48 to a81f889 Compare October 6, 2014 15:59

Holzhaus force-pushed the vocabcompiler-abstraction branch 2 times, most recently from fa9d821 to 9616b1b Compare October 8, 2014 11:04

Holzhaus added 17 commits October 8, 2014 20:53

Minor style fix in phonetisaurus-g2p code

783b673

Fix G2P testcases, do not depend on fixed translation string

93322f0

Remove unneccessary print from g2p.py

afbf131

Remove unused tempfile from client/g2p.py

5f72166

PEP8 style fixes in test.py and vocabcompiler.py

e3003e4

Add unittests for G2P without phonetisaurus

58326ad

Add testcases for (patched) PocketsphinxVocabulary

d5c3b65

improve vocabulary unittest coverage

6ea8bec

Add test for keyword phrase extraction

6d25d90

Update TestMic testcase to work with Vocabulary

ed050c5

Further improve unittest coverage

a52b3db

Remove unneccessary pass statement from AbstractVocabulary class

666a633

This should raise vocabcompiler test coverage to 100%. Whohooo!

Use diagnose.check_executable for executable detection in g2p.py

48367d2

Reorder imports

210df8e

Use configfile from jasper config dir in g2p.py (i.e. use )

1bd9e21

remove redundant variable assignment

ea88f40

Fix G2P testcase

bd9e667

Holzhaus force-pushed the vocabcompiler-abstraction branch from 7140ee1 to bd9e667 Compare October 8, 2014 18:53

Holzhaus mentioned this pull request Oct 9, 2014

Remove transcription mode #219

Merged

Holzhaus added a commit that referenced this pull request Oct 13, 2014

Merge pull request #181 from Holzhaus/vocabcompiler-abstraction

f1ffdd7

Rewrite of vocabcompiler

Holzhaus merged commit f1ffdd7 into jasperproject:master Oct 13, 2014

Holzhaus deleted the vocabcompiler-abstraction branch October 13, 2014 12:23

Holzhaus removed the needstesting label Jan 7, 2015

Conversation

Holzhaus commented Sep 17, 2014

Uh oh!

Holzhaus commented Sep 18, 2014

Uh oh!

Holzhaus commented Sep 25, 2014

Uh oh!

Holzhaus commented Sep 25, 2014

Uh oh!

alexsiri7 Sep 26, 2014

Choose a reason for hiding this comment

Uh oh!

Holzhaus Sep 26, 2014

Choose a reason for hiding this comment

Uh oh!

Holzhaus commented Sep 26, 2014

Uh oh!

Holzhaus commented Sep 26, 2014

Uh oh!

Holzhaus commented Sep 27, 2014

Uh oh!

charliermarsh Oct 1, 2014

Choose a reason for hiding this comment

Uh oh!

charliermarsh commented Oct 1, 2014

Uh oh!

Holzhaus commented Oct 1, 2014

Uh oh!

Holzhaus commented Oct 3, 2014

Uh oh!

Holzhaus commented Oct 6, 2014

Uh oh!

Holzhaus commented Oct 8, 2014

Uh oh!

coveralls commented Oct 8, 2014

Uh oh!

Holzhaus commented Oct 8, 2014

Uh oh!

Holzhaus commented Oct 9, 2014

Uh oh!

shbhrsaha commented Oct 9, 2014

@crm416 @shbhrsaha Can you please test this?

Uh oh!

Holzhaus commented Oct 10, 2014

Uh oh!

shbhrsaha commented Oct 13, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants