Weighted finite-state transducers (WFSTs) have revolutionized
automatic speech recognition (ASR) by enabling significantly faster decoding
speeds compared to traditional systems that build the search space
progressively. However, applying WFSTs to morphology-rich languages
such as Arabic presents challenges due to the large vocabulary, resulting in
extensive networks that exceed the memory capacity of standard CPUs. This
study introduces various strategies to reduce the size of large vocabulary
Arabic WFSTs with minimal impact on accuracy. We employed a star
architecture for the network topology, which effectively reduced the network
size and improved the decoding speed. Additionally, a two-pass decoding
approach was adopted: the first pass used a smaller network with a short
history language model, and the second pass rescored the produced lattice
with a longer history language model. We explored several tuning parameters
to find the optimal balance between network size and accuracy. Our results
show that by using an optimized search graph built with a 2-gram language
model instead of a 3-gram model, we achieve a 45% reduction in the graph’s
memory footprint with a negligible accuracy loss of less than 0.2% MR-WER.
On the MGB3 benchmark, our method achieved 40x real-time Arabic ASR
data processing with an accuracy of 83.67%, compared to the 85.82% accuracy
of state-of-the-art systems, which only achieve 8x real-time performance on
standard CPUs. |