Edobashira
Sunday, 12 August 2012
Adding a User Defined Fst to OpenFst
All the code for the mmap-fst is avaliable in this gist. The implementation of the mmap-fst is contained in the file mmap-fst.h, the wrapper class around the mmap functions is taken from the mozc project. The wrapper supports both *nix and Windows oses, although only the former support the shared objects described in the article. I changed the length type in the Mmap class to allow larger than 2GB files. The MmapFst itself is basically a cut n' paste job on the ConstFst, the main difference is the MmapFst uses the mmap wrapper instead of array to access the state and arc information.
Given the Fst code we want to make the OpenFst library and existing OpenFst command line tools aware of of the MmapFst type. To achieve this we need the c++ code and the Makefile to construct a shared object. The registration code that is shown below and is very simple
The makefile below will compile the code to a shared object.
The final step is to make sure the shared object is on the LD_LIBRARY_PATH. We can then convert fsts to the mmap type and use it in all the command line tools. For example we can convert an Fst to MmapFst type by:
fstconvert --fst_typo=mmap < some.fst > mmap.fst
and view fst information by:
fstinfo mmap.fst
fst type mmap
arc type standard
input symbol table none
.....
....
Wednesday, 8 August 2012
Shortest Paths in Weighted Graphs using a Stack
$\hspace{3em}\text{Decrease-key}(Q, e.v, d[e.v])$
$\hspace{2em}\text{else}$
Consider the below graph, the vertex (state) marked 0 is the initial or start state and the double circle is the final or accept vertex (state), although the algorithm is computing the shortest path from vertex zero to every other vertex we are only interested in the shortest path to state three. On inspection it easy to see the shortest path is to follow all edges with cost 0 that is the path $0\rightarrow1\rightarrow2\rightarrow3$
the second column the element popped at that time step, the queue column is the items in the queue at the end of the current iteration and the substituent columns are the vertex costs. Here we see that for our graph that eight pops are required for the this graph as each path that has more edges but a lower
overall costs requires us to re-visit all the subsequent states again.
| Time | Pop | Queue | $d[0]$ | $d[1]$ | $d[2]$ | $d[3]$ |
| 0 | - | 0 | $0$ | $\infty$ | $\infty$ | $\infty$ |
| 1 | 0 | 1,2,3 | $0$ | $0$ | $2$ | $3$ |
| 2 | 3 | 1,2 | $0$ | $0$ | $2$ | $3$ |
| 3 | 2 | 1,3 | $0$ | $0$ | $2$ | $2$ |
| 4 | 3 | 1 | $0$ | $0$ | $2$ | $2$ |
| 5 | 1 | 2,3 | $0$ | $0$ | $0$ | $1$ |
| 6 | 3 | 2 | $0$ | $0$ | $0$ | $1$ |
| 7 | 2 | 3 | $0$ | $0$ | $0$ | $0$ |
| 8 | 3 | $\emptyset$ | $0$ | $0$ | $0$ | $0$ |
Repeating the analysis for a FIFO queue we see that it only necessary to visit each state once.
| Time | Pop | Queue | $d[0]$ | $d[1]$ | $d[2]$ | $d[3]$ |
| 0 | - | 0 | $0$ | $\infty$ | $\infty$ | $\infty$ |
| 1 | 0 | 1,2,3 | $0$ | $0$ | $2$ | $3$ |
| 1 | 1 | 2,3 | $0$ | $0$ | $0$ | $1$ |
| 1 | 2 | 3 | $0$ | $0$ | $0$ | $0$ |
| 1 | 3 | $\emptyset$ | $0$ | $0$ | $0$ | $0$ |
In summary when computing shortest paths the LIFO queue is fine if the graph has no weights each state only needs to be visited once. For weighted graphs either a FIFO or priority queue is a better choice, the later having better running time. Although in practise I have observed the FIFO approach consistently performing better. For graphs with negative weighted edges a priority queue can also exhibit exponential worst case running time.
[1] Jeff Erickson, "Algorithms Lecture 19: Shortest Paths", 2011, http://www.cs.uiuc.edu/~jeffe/teaching/algorithms/notes/19-sssp.pdf
Friday, 4 May 2012
C++11x Range Based for loops
Here is a first pass at using C++11x range based for loops to over states and arcs in an OpenFst Fst. It also uses type inference (new use of the auto keyword.)
Requires the compiler has the relevant C++11x support, this code was tested using Visual Studio 11 Beta
#include <iostream>
#include <fst/vector-fst.h>
using namespace fst;
using namespace std;
template<class F>
class StateRange {
public:
explicit StateRange(const F& f) : f_(f) { }
template<class F>
class Iterator {
typedef typename F::StateId StateId;
public:
explicit Iterator (const F& f) : si_(f) { }
bool operator != (const Iterator & other) const {
return !si_.Done();
}
StateId operator* () const {
return si_.Value();
}
const Iterator& operator++ () {
si_.Next();
return *this;
}
StateIterator<F> si_;
};
public:
Iterator<F> begin() const { return Iterator<F>(f_); }
Iterator<F> end() const { return Iterator<F>(f_); }
const F& f_;
};
template<class F>
StateRange<F> States(const F& f) {
return StateRange<F>(f);
}
template<class F>
class ArcRange {
public:
typedef typename F::StateId StateId;
ArcRange(const F& f, StateId s) : f_(f), s_(s) { }
template<class F>
class Iterator {
typedef typename F::Arc Arc;
typedef typename F::StateId StateId;
public:
Iterator (const F& f, StateId s) : ai_(f, s) { }
bool operator != (const Iterator & other) const {
return !ai_.Done();
}
const Arc& operator* () const {
return ai_.Value();
}
const Iterator& operator++ () {
ai_.Next();
return *this;
}
ArcIterator<F> ai_;
};
public:
Iterator<F> begin() const { return Iterator<F>(f_, s_); }
Iterator<F> end() const { return Iterator<F>(f_, s_); }
const F& f_;
const StateId s_;
};
template<class F>
ArcRange<F> Arcs(const F& f, typename F::StateId s) {
return ArcRange<F>(f, s);
}
int main() {
StdVectorFst fst;
fst.AddState();
fst.AddState();
fst.SetStart(0);
fst.SetFinal(1, TropicalWeight::One());
fst.AddArc(0, StdArc(0, 0, 0, 1));
fst.AddArc(0, StdArc(1, 0, 0, 1));
fst.AddArc(0, StdArc(2, 0, 0, 1));
for (auto s : States(fst))
for (auto a : Arcs(fst, s))
printf("%d %d %d %d %f\n", s, a.nextstate, a.ilabel, a.olabel, a.weight.Value());
}
Wednesday, 27 July 2011
Install MITLM Under Cygwin on Windows
MITLM is a very nice language modelling toolkit. It is fast, easy to use and consistently produces language models that out perform those generated by other language model toolkits. Building under Cygwin with a few additional commands is possible:
svn checkout http://mitlm.googlecode.com/svn/trunk/ mitlm-read-only
cd mitlm-read-only
ln -s /lib/gcc/i686-pc-cygwin/3.4.4/libg2c.a
/lib/gcc/i686-pc-cygwin/4.3.4/libg2c.a
sed 's/LDFLAGS = -L. -lg2c -lmitlm/LDFLAGS = -L. -lmitlm -lg2c/' <
Makefile.example > Makefile
make -j
Tuesday, 26 July 2011
Latex In Blogger
After seeing the Latex support in wordpress.com blogs, found this nice explanation for adding Latex support to blogger.
${a^nb^n| n \geq 0}$
The new blogger interface makes it a pain to edit the HTML template now.
Wednesday, 16 March 2011
Setting up OpenFst 1.2.7 with Visual Studio
1 Obtain (and Build) the OpenFst binaries
Download the pre-built 1.2.7 binaries and source from:
http://code.google.com/p/openfstwin/downloads/list
If you want to rebuild from source then, open the solution openfstwin.sln in Visual Studio 2010 and build the Debug and/or Release configurations. This will build all the command lines and OpenFst dll for the selected configurations.
2 Set PATH variables
Add the openfstwin Release and/or Debug sub-directories to the system or user PATH variable. (Control Panel\System and Security\System, Advanced System Settings->Advanced, Environment). This allows the system to find the OpenFst dlls when an executable is run.
3 Create new project
Use the C++ project wizard to create an empty Win32 Console Application and give it name for example fsttest.
Create a Console application and select Empty project
4 Create a Property Page for OpenFst or set paths
Unfortunately Microsoft removed the Global include/lib/bin options from Visual Studio 2010. This means we have to setup the OpenFfst include and lib directories for every project. There are two workarounds to make thing easier for subsequent project . Set the INCLUDE and LIBPATH variables (vis Control Panel\System and Security\System, Advanced System Settings->Advanced, Environment).
An alternative re-usable workaround is to a Property Sheet for the OpenFst directories and re-use it for all the new projects. This technique has the advantage that intellisense will work for the OpenFst library.
Go to the Property Manager [View->Property Manager], and then right click on any project node and select properties from the context menu.
Add a new property sheet but store it somewhere for use in other project. In this example the property sheet is called openfst.
5 Setup the C++ Directory Paths
Next right click one of openfst.prop nodes and select properties. Go to C++ General and Open [Additional Include Directories]
Add the full path to the OpenFst include to the “Include Directories” and both the Debug and Release folders to the “Library Directory”
This version of OpenFstWin uses the auto-linking support in Visual Studio and therefore it is not necessary to manual set a reference to the library or dll. The debug dll is names openfst-gd.lib/dll andthe release builds are openfst.lib/dll.
Note it is essential to link the correct version (the auto-linking should handle this for you anyway) otherwise the incompatibilities between the iterator debugging for the two builds will cause incomprehensible crashes. In addition do not disable iterator debugging in your client applications debug builds (or enable it in release) without making the necessary changes to the core OpenFst visual builds.
After setting the environment variables everything should happen automatically!
Debugging
The debug build will generate debug information so it will become possible to step into the library cc files. All of the algorithm are implemented as header files. The symbol table and parts of the FST IO routines are in cc files, if you do not need to step into this code you can skip this step.
To add the debug Tools->Options->Debugging->Symbols and add the path to the OpenFst debug.
Will all these setting you can add a C++ code files to your project and build and debug against OpenFst.
Saturday, 14 August 2010
Lookahead Composition in Openfst 1.2
Openfst 1.2 was just released. One of the big addition is the lookahead composition filters, more details can be found in [2] and [3]. Get the latest source from here . By default the lookahead filters are not built so run configure with all the options.
configure --enable-compact-fsts --enable-const-fsts --enable-far --enable-lookahead-fsts --enable-pdt
Here is quick example of using the lookahead filters to compose two speech recognition transducers, a lexicon L and and language model G. Using these filter it is no possible to determinize the lexicon before composition with G. The advantage is the costly determinization of LG is avoided.
These are the steps to perform an offline lookahead compostion. The below assumes a closed lexicon L with auxillary symbols and and n-gram language model G constructed according to [1]. In this case L has 563232 of states and 656819 arcs and G has 997740 states and 4764980 arcs.
First determinize and convert the lexicon to an olabel_lookahead type fst. The save_olabel_pairs options although not shown in the fstconvert help output is valid option for the command.
fstdeterminize l.fst | fstconvert –fst_type=olabel_lookahead –save_relabel_opairs=relabel.txt > lookahead.l.fst
Next relabel the input of the language model to match the lookahead lexicon. The arcsort command will sort the input of language model. The sorting is essential for efficient composition in the following stage.
fstrelabel -relabel_ipairs=relabeltxt g.fst| fstarcsort > relabel.g.fst
Finally compose the L and G machines togeather. The compose tool can automatically detect that the L transducer is a lookahead type and select the optimized filter.
fstcompose lookahead.l.fst sorted.g.fst > lg.fst
Here is the output of fstinfo for lg.fst
# of states 5863514
# of arcs 9630754
# of final states 1
# of input/output epsilons 1414832
# of input epsilons 1904971
# of output epsilons 5863511
# of accessible states 5863514
# of coaccessible states 5863514
# of connected states 5863514
# of strongly conn components 150128
For a comparison a more standard det(LG) cascade.
# of states 6294455
# of arcs 10053167
# of final states 1
# of input/output epsilons 1995247
# of input epsilons 1995247
# of output epsilons 6302523
# of accessible states 6294455
# of coaccessible states 6294455
# of connected states 6294455
# of strongly conn components 168471
The difference in size is party due to the epsilon closure in the non-determinized L transducers.
[1] Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley.
Speech recognition with weighted finite-state transducers.
In Larry Rabiner and Fred Juang, editors, Handbook on Speech Processing and Speech Communication, Part E: Speech recognition. Springer-Verlag, Heidelberg, Germany, 2008.
[2] Cyril Allauzen, Michael Riley, Johan Schalkwyk, A Generalized Composition Algorithm for Weighted Finite-State Transducers, Interspeech 2009.
[3] Cyril Allauzen, Michael Riley and Johan Schalkwyk, Filters for Efficient Composition of Weighted Finite-State Transducer", Proceedings of the Fifteenth International Conference on Implementation and Application of Automata, (CIAA 2010), Winnipeg, MB.

