CLR cover

Problem 71: Factor oracle

The Factor oracle is an indexing structure similar to the Factor or Suffix automaton (or DAWG) of a word \(x\). It is a deterministic automaton with \(|x|+1\) states, the minimum number of states the Suffix automaton of \(x\) can have. This makes it a well-suited data structure in many applications that require a simple indexing structure and leads both to a space-economical data structure and to an efficient online construction. The drawback is that the oracle of \(x\) accepts slightly more words than the factors of \(x\).

For a factor \(v\) of \(y\), let \(pocc(v,y)\) be the position on \(y\) following the first occurrence of \(v\) in \(y\), that is, \(pocc(v,y) = \min \{|z| \mid z = wv \mbox{ prefix of } y\}\). The following algorithm may be viewed as a definition of the Factor oracle \({\cal O}(x)\) of a word \(x\). It computes the automaton in which \(Q\) is the set of states and \(E\) the set of labelled edges.

Oracle\((x \mbox{ non-empty word})\)
    \begin{algorithmic}
  \STATE $(Q,E)\leftarrow (\{0,1,\dots,|x|\},\emptyset)$
  \FOR{$i \leftarrow 0$ \TO $|x| - 1$}
     \STATE $u \leftarrow $ shortest word recognised in state $i$
  \ENDFOR
     \FOR{$a\in A$}
       \IF{ $ua \in Fact(x[i-|u|..|x|-1])$}
         \STATE $E \leftarrow E \cup\{(i,a,pocc(ua,x[i-|u|..|x|-1]))\}$
       \ENDIF
     \ENDFOR
  \RETURN $(Q,E)$
    \end{algorithmic}

Actually the structure has several interesting properties. Its \(|x|+1\) states are all terminal states. Every edge whose target is \(i+1\) is labelled by \(x[i]\). There are \(|x|\) edges of the form \((i,x[i],i+1)\), called internal edges. Other edges, of the form \((j,x[i],i+1)\) with \(j\le i\), are called external edges. The oracle can thus be represented by \(x\) and its set of external edges without their labels.

Show that the Factor oracle of a word $x$ has between $|x|$ and $2|x|-1$ edges.

Design an online construction of the Factor oracle of a word $x$ running in linear time on a fixed alphabet with linear space.
Use suffix links.

Show the Factor oracle ${\cal O}(x)$ can be used for locating all the occurrences of $x$ in a text, despite the oracle may accept words that are not factors of $x$.
The only word of length $|x|$ recognised by ${\cal O}(x)$ is $x$ itself.

References

  • C. Allauzen, M. Crochemore, and M. Raffinot. Factor oracle: A new structure for pattern matching. In J. Pavelka, G. Tel, and M. Bartosek, editors, SOFSEM'99, Theory and Practice of Informatics, 26th Conference on Current Trends in Theory and Practice of Informatics, Milovy, Czech Republic, November 27 - December 4, 1999, Proceedings, volume 1725 of Lecture Notes in Computer Science, pages 295-310. Springer, 1999.
  • G. Assayag and S. Dubnov. Using factor oracles for machine improvisation. Soft Comput., 8(9):604-610, 2004.
  • J. Bourdon and I. Rusu. Statistical properties of factor oracles. J. Discrete Algorithms, 9(1):57-66, 2011.
  • M. Crochemore, L. Ilie, and E. Seid-Hilmi. The structure of factor oracles. Int. J. Found. Comput. Sci., 18(4):781-797, 2007.
  • H. Fan, N. Yao, and H. Ma. Fast variants of the Backward-Oracle- Marching algorithm. In ICICSE'09, Fourth International Conference on Internet Computing for Science and Engineering, pages 56-59. IEEE Computer Society, 2009.
  • S. Faro and T. Lecroq. Efficient variants of the Backward-Oracle- Matching algorithm. Int. J. Found. Comput. Sci., 20(6):967-984, 2009.
  • A. Lefebvre and T. Lecroq. Compror: On-line lossless data compression with a factor oracle. Inf. Process. Lett., 83(1):1-6, 2002.
  • A. Mancheron and C. Moan. Combinatorial characterization of the language recognized by factor and suffix oracles. Int. J. Found. Comput. Sci., 16(6):1179-1191, 2005.