## Sunday, 12 August 2012

### Adding a User Defined Fst to OpenFst

In the post I will show how to add user defined Fst type to OpenFst. openfst.org doesn't go into detail on the process to add a new Fst type. The Fst type is memory mappable immutable transducers and this should allow for the same Fst instance to be shared across processes that load the same Fst.
All the code for the mmap-fst is avaliable in this gist. The implementation of the mmap-fst is contained in the file mmap-fst.h, the wrapper class around the mmap functions is taken from the mozc project. The wrapper supports both *nix and Windows oses, although only the former support the shared objects described in the article. I changed the length type in the Mmap class to allow larger than 2GB files. The MmapFst itself is basically a cut n' paste job on the ConstFst, the main difference is the MmapFst uses the mmap wrapper instead of  array to access the state and arc information.

Given the Fst code we want to make the OpenFst library and existing OpenFst command line tools aware of of the MmapFst type. To achieve this we need the c++ code and the Makefile to construct a shared object. The registration code that is shown below and is very simple

The makefile below will compile the code to a shared object.

The final step is to make sure the shared object is on the LD_LIBRARY_PATH. We can then convert fsts to the mmap type and use it in all the command line tools. For example we can convert an Fst to MmapFst  type by:

fstconvert --fst_typo=mmap < some.fst > mmap.fst

and view fst information by:

fstinfo mmap.fst

fst type                                          mmap
arc type                                         standard
input symbol table                        none
.....
....

## Wednesday, 8 August 2012

### Shortest Paths in Weighted Graphs using a Stack

When computing shortest paths in graph (for example as partof a beam search) it is tempting to apply a pruned recursive algorithm. This is equivalent to performing a generic shortest path algorithm with
last-in first out LIFO queue (a stack). In an un-weighted graph this approachis fine, however in weighted graphs even with positive only weight this algorithm can have exponential worst case complexity.

The generic single source shorestpath algorihtm is shown below[1]. It assumes the input is a graph $G(V,E)$ where $V$ and $E$ denotes the vertices and edges respectively, $e$ is an edge with member $v$ for the destiantion vertex and $w$ for the edge costs,  $d$ is an array of vertex (or state) costs.In the first three lines of the algorithm the array of vertex costs is initialised. The initial state of the graph is added to the queue and the then the main loop of the algorithm is run. We are interested in a stack/lifo queue, therefore the dequeue operation of line seven will remove the last vertex that was added to the queue. The algorithm then process each of the arcs leaving the state and if the destination state is lower we update the cost eleement and either add the destination to the queue or decrease the key if the state is already in the queue. For the stack in this analysis the decrease-key operation does nothing as there are no priorities in this case and we update the statce costs in line ten. (The decrease-key is needed for priority queues).

$d[0] \leftarrow 0$
$\text{for } i = 1 \dots |V| - 1$
$\hspace{1em} d[i] \leftarrow \infty$
$Q \leftarrow s$
$\text{while } Q \neq \emptyset$
$\hspace{1em} u \leftarrow \text{Dequeue}(Q)$
$\hspace{1em} \text{for } e \in \text{adj}(Q, u)$
$\hspace{1em} \text{if } d[e.v] > d[u] + e.w$
$\hspace{2em} d[e.v] \leftarrow d[u] + e.w$
$\hspace{2em}\text{if } e.v \in Q$
$\hspace{3em}\text{Decrease-key}(Q, e.v, d[e.v])$
$\hspace{2em}\text{else}$
$\hspace{3em}\text{Enqueue}(Q, v.e)$

If we use a first-in first-out (FIFO) queue the algorithm becomes equivalent to the
Bellman-ford aglorithm. In the case of priority queue discipline the algorithm corresponds to
Dijkstra algorithm. In the remainder of this article I will explore the SSSP algorithm with a
LIFO queue and will show certain weighted graphs give rise undesirable SSSP running times.

Consider the below graph, the vertex (state) marked 0 is the initial or start state and the double circle is the final or accept vertex (state), although the algorithm is computing the shortest  path from vertex zero to every other vertex we are only interested in the shortest path to state three. On inspection it easy to see the shortest path is to follow all edges with cost 0 that is the path $0\rightarrow1\rightarrow2\rightarrow3$

The below table shows the steps in the algorithm. The first column is the iteration algorithm,
the second column the element popped at that time step, the queue column is the items in the queue at the end of the current iteration and the substituent columns are the vertex costs. Here we see that for our graph that eight pops are required for the this graph as each path that has more edges but a lower
overall costs requires us to re-visit all the subsequent states again.

 Time Pop Queue $d[0]$ $d[1]$ $d[2]$ $d[3]$ 0 - 0 $0$ $\infty$ $\infty$ $\infty$ 1 0 1,2,3 $0$ $0$ $2$ $3$ 2 3 1,2 $0$ $0$ $2$ $3$ 3 2 1,3 $0$ $0$ $2$ $2$ 4 3 1 $0$ $0$ $2$ $2$ 5 1 2,3 $0$ $0$ $0$ $1$ 6 3 2 $0$ $0$ $0$ $1$ 7 2 3 $0$ $0$ $0$ $0$ 8 3 $\emptyset$ $0$ $0$ $0$ $0$

Repeating the analysis for a FIFO queue we see that it only necessary to visit each state once.

 Time Pop Queue $d[0]$ $d[1]$ $d[2]$ $d[3]$ 0 - 0 $0$ $\infty$ $\infty$ $\infty$ 1 0 1,2,3 $0$ $0$ $2$ $3$ 1 1 2,3 $0$ $0$ $0$ $1$ 1 2 3 $0$ $0$ $0$ $0$ 1 3 $\emptyset$ $0$ $0$ $0$ $0$
Likewise the algorithm runs in the same time for a priority queue, however, a graph with negative weights can be constructed to yield exponential running time. For example re-weighting the graph as shown below will give the same characteristics as the LIFO queue with positive weights considered earlier

In summary when computing  shortest paths the  LIFO queue is fine if the graph has no weights each state only needs to be visited once. For weighted graphs either a FIFO or priority queue is a better choice, the later having better running time. Although in practise I have observed the FIFO approach consistently performing better. For graphs with negative weighted edges a priority queue can also exhibit exponential worst case running time.

[1] Jeff Erickson, "Algorithms Lecture 19: Shortest Paths", 2011,  http://www.cs.uiuc.edu/~jeffe/teaching/algorithms/notes/19-sssp.pdf

## Friday, 4 May 2012

### C++11x Range Based for loops

Here is a first pass at using C++11x range based for loops to over states and arcs in an OpenFst Fst. It also uses type inference (new use of the auto keyword.)

Requires the compiler has the relevant C++11x support, this code was tested using Visual Studio 11 Beta

#include <iostream>

#include <fst/vector-fst.h>

using namespace fst;

using namespace std;

template<class F>

class StateRange {

public:

explicit StateRange(const F& f) : f_(f) { }

template<class F>

class Iterator {

typedef typename F::StateId StateId;

public:

explicit Iterator (const F& f) : si_(f) {  }

bool operator != (const Iterator & other) const {

return !si_.Done();

}

StateId operator* () const {

return si_.Value();

}

const Iterator& operator++ () {

si_.Next();

return *this;

}

StateIterator<F> si_;

};

public:

Iterator<F> begin() const { return Iterator<F>(f_); }

Iterator<F> end() const { return Iterator<F>(f_); }

const F& f_;

};

template<class F>

StateRange<F> States(const F& f) {

return StateRange<F>(f);

}

template<class F>

class ArcRange {

public:

typedef typename F::StateId StateId;

ArcRange(const F& f, StateId s) : f_(f), s_(s) {   }

template<class F>

class Iterator {

typedef typename F::Arc Arc;

typedef typename F::StateId StateId;

public:

Iterator (const F& f, StateId s) : ai_(f, s) {  }

bool operator != (const Iterator & other) const {

return !ai_.Done();

}

const Arc& operator* () const {

return ai_.Value();

}

const Iterator& operator++ () {

ai_.Next();

return *this;

}

ArcIterator<F> ai_;

};

public:

Iterator<F> begin() const { return Iterator<F>(f_, s_); }

Iterator<F> end() const { return Iterator<F>(f_, s_); }

const F& f_;

const StateId s_;

};

template<class F>

ArcRange<F> Arcs(const F& f, typename F::StateId s) {

return ArcRange<F>(f, s);

}

int main() {

StdVectorFst fst;

fst.SetStart(0);

fst.SetFinal(1, TropicalWeight::One());

fst.AddArc(0, StdArc(0, 0, 0, 1));

fst.AddArc(0, StdArc(1, 0, 0, 1));

fst.AddArc(0, StdArc(2, 0, 0, 1));

for (auto s : States(fst))

for (auto a : Arcs(fst, s))

printf("%d %d %d %d %f\n", s, a.nextstate, a.ilabel, a.olabel, a.weight.Value());

}

## Wednesday, 27 July 2011

### Install MITLM Under Cygwin on Windows

MITLM is a very nice language modelling toolkit. It is fast, easy to use and consistently produces language models that out perform those generated by other language model toolkits. Building under Cygwin with a few additional commands is possible:

svn checkout http://mitlm.googlecode.com/svn/trunk/ mitlm-read-onlycd mitlm-read-onlyln -s /lib/gcc/i686-pc-cygwin/3.4.4/libg2c.a/lib/gcc/i686-pc-cygwin/4.3.4/libg2c.ased 's/LDFLAGS  = -L. -lg2c -lmitlm/LDFLAGS  = -L. -lmitlm -lg2c/' <Makefile.example > Makefilemake -j

## Tuesday, 26 July 2011

### Latex In Blogger

After seeing the Latex support in wordpress.com blogs, found this nice explanation for adding Latex support to blogger.

${a^nb^n| n \geq 0}$

The new blogger interface makes it a pain to edit the HTML template now.

## Wednesday, 16 March 2011

### Setting up OpenFst 1.2.7 with Visual Studio

1 Obtain (and Build) the OpenFst binaries

Download the pre-built 1.2.7 binaries and source from:

If you want to rebuild from source then, open the solution openfstwin.sln in Visual Studio 2010 and build the Debug and/or Release configurations. This will build all the command lines and OpenFst dll for the selected configurations.

2 Set PATH variables

Add the openfstwin Release and/or Debug sub-directories to the system or user PATH variable.  (Control Panel\System and Security\System, Advanced System Settings->Advanced, Environment). This allows the system to find the OpenFst dlls when an executable is run.

3 Create new project

Use the C++ project wizard to create an empty Win32 Console Application and give it name for example fsttest.

Create a Console application and select Empty project

4 Create a Property Page for OpenFst or set paths

Unfortunately Microsoft removed the Global include/lib/bin options from Visual Studio 2010. This means we have to setup the OpenFfst include and lib directories for every project. There are two workarounds to make thing easier for subsequent project . Set the INCLUDE and LIBPATH variables (vis Control Panel\System and Security\System, Advanced System Settings->Advanced, Environment).

An alternative re-usable workaround is to a Property Sheet for the OpenFst directories and re-use it for all the new projects. This technique has the advantage that intellisense will work for the OpenFst library.

Go to  the Property Manager [View->Property Manager], and then right click on any project node and select properties from the context menu.

Add a new property sheet but store it somewhere for use in other project. In this example the property sheet is called openfst.

5 Setup the C++ Directory Paths

Next right click one of openfst.prop nodes and select properties. Go to C++ General and Open [Additional Include Directories]

Add the full path to the OpenFst include to the “Include Directories” and both the Debug and Release folders to the “Library Directory”

This version of OpenFstWin uses the auto-linking support in Visual Studio and therefore it is not necessary to manual set a reference to the library or dll. The debug dll is names openfst-gd.lib/dll andthe release builds are openfst.lib/dll.

Note it is essential to link the correct version (the auto-linking should handle this for you anyway) otherwise the incompatibilities between the iterator debugging for the two builds will cause incomprehensible crashes. In addition do not disable iterator debugging in your client applications debug builds (or enable it in release) without making the necessary changes to the core OpenFst visual builds.

After setting the environment variables everything should happen automatically!

Debugging

The debug build will generate debug information so it will become possible to step into the library cc files. All of the algorithm are implemented as header files. The symbol table and parts of the FST IO routines are in cc files, if you do not need to step into this code you can skip this step.

To add the debug Tools->Options->Debugging->Symbols and add the path to the OpenFst debug.

Will all these setting you can add a C++ code files to your project and build and debug against OpenFst.

## Saturday, 14 August 2010

### Lookahead Composition in Openfst 1.2

Openfst 1.2 was just released. One of the big addition is the lookahead composition filters, more details can be found in [2] and [3]. Get the latest source from here . By default the lookahead filters are not built so run configure with all the options.

configure --enable-compact-fsts --enable-const-fsts --enable-far --enable-lookahead-fsts --enable-pdt

Here is quick example of using the lookahead filters to compose  two speech recognition transducers, a lexicon L and and language model G. Using these filter it is no possible to determinize the lexicon before composition with G. The advantage is the costly determinization of LG is avoided.

These are the steps to perform an offline lookahead compostion. The below assumes a closed lexicon L with auxillary symbols and and n-gram language model G constructed according to [1]. In this case L has 563232 of states and 656819 arcs and G has 997740 states and 4764980 arcs.

First determinize and convert the lexicon to an olabel_lookahead type fst. The save_olabel_pairs options although not shown in the fstconvert help output is valid option for the command.

fstdeterminize l.fst | fstconvert –fst_type=olabel_lookahead  –save_relabel_opairs=relabel.txt > lookahead.l.fst

Next relabel the input of the language model to match the lookahead lexicon. The arcsort command will sort the input of language model. The sorting is essential for efficient composition in the following stage.

fstrelabel -relabel_ipairs=relabeltxt g.fst| fstarcsort > relabel.g.fst

Finally compose the L and G machines togeather. The compose tool can automatically detect that the L transducer is a lookahead type and select the optimized filter.

fstcompose lookahead.l.fst sorted.g.fst > lg.fst

Here is the output of fstinfo for lg.fst

# of states 5863514
# of arcs 9630754

# of final states 1
# of input/output epsilons 1414832
# of input epsilons 1904971
# of output epsilons 5863511
# of accessible states 5863514
# of coaccessible states 5863514
# of connected states 5863514
# of strongly conn components 150128

For a comparison a more standard  det(LG) cascade.

# of states 6294455
# of arcs 10053167
# of final states 1
# of input/output epsilons 1995247
# of input epsilons 1995247
# of output epsilons 6302523
# of accessible states 6294455
# of coaccessible states 6294455
# of connected states 6294455
# of strongly conn components 168471

The difference in size is party due to the epsilon closure in the non-determinized L transducers.

[1]  Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley.
Speech recognition with weighted finite-state transducers.
In Larry Rabiner and Fred Juang, editors, Handbook on Speech Processing and Speech Communication, Part E: Speech recognition. Springer-Verlag, Heidelberg, Germany, 2008.

[2] Cyril Allauzen, Michael Riley, Johan Schalkwyk, A Generalized Composition Algorithm for Weighted Finite-State Transducers, Interspeech 2009.

[3] Cyril Allauzen, Michael Riley and Johan Schalkwyk, Filters for Efficient Composition of Weighted Finite-State Transducer", Proceedings of the Fifteenth International Conference on Implementation and Application of Automata, (CIAA 2010), Winnipeg, MB.