|
Problem 71: Factor oracle
|
|
The Factor oracle is an indexing structure similar to the Factor or Suffix
automaton (or DAWG) of a word \(x\).
It is a deterministic automaton with \(|x|+1\) states, the minimum
number of states the Suffix automaton of \(x\) can
have.
This makes it a well-suited data structure in many applications that
require a simple indexing structure and leads both to a space-economical data
structure and to an efficient online construction. The drawback is that the
oracle of \(x\) accepts slightly more words than the factors of
\(x\).
For a factor \(v\) of \(y\), let \(pocc(v,y)\) be
the position on \(y\) following the first occurrence of \(v\)
in \(y\), that is, \(pocc(v,y) = \min \{|z| \mid z = wv \mbox{ prefix of } y\}\).
The following algorithm may be viewed as a definition of the Factor oracle
\({\cal O}(x)\) of a word \(x\).
It computes the automaton in which \(Q\) is the set of states and
\(E\) the set of labelled edges.
Oracle\((x \mbox{ non-empty word})\)
\begin{algorithmic}
\STATE $(Q,E)\leftarrow (\{0,1,\dots,|x|\},\emptyset)$
\FOR{$i \leftarrow 0$ \TO $|x| - 1$}
\STATE $u \leftarrow $ shortest word recognised in state $i$
\ENDFOR
\FOR{$a\in A$}
\IF{ $ua \in Fact(x[i-|u|..|x|-1])$}
\STATE $E \leftarrow E \cup\{(i,a,pocc(ua,x[i-|u|..|x|-1]))\}$
\ENDIF
\ENDFOR
\RETURN $(Q,E)$
\end{algorithmic}
Actually the structure has several interesting properties.
Its \(|x|+1\) states are all terminal states.
Every edge whose target is \(i+1\) is labelled by \(x[i]\).
There are \(|x|\) edges of the form \((i,x[i],i+1)\), called internal edges.
Other edges, of the form \((j,x[i],i+1)\) with \(j\le i\), are called
external edges.
The oracle can thus be represented by \(x\) and its set of external edges
without their labels.
Show that the Factor oracle of a word $x$ has between $|x|$ and $2|x|-1$
edges.
Design an online construction of the Factor oracle of a word $x$ running in
linear time on a fixed alphabet with linear space.
Use suffix links.
Show the Factor oracle ${\cal O}(x)$ can be used for locating all the
occurrences of $x$ in a text, despite the oracle may accept words that are
not factors of $x$.
The only word of length $|x|$ recognised by ${\cal O}(x)$ is $x$ itself.
References
C. Allauzen, M. Crochemore, and M. Raffinot. Factor oracle: A new
structure for pattern matching. In J. Pavelka, G. Tel, and M. Bartosek,
editors, SOFSEM'99, Theory and Practice of Informatics, 26th Conference
on Current Trends in Theory and Practice of Informatics, Milovy,
Czech Republic, November 27 - December 4, 1999, Proceedings, volume
1725 of Lecture Notes in Computer Science, pages 295-310. Springer,
1999.
G. Assayag and S. Dubnov. Using factor oracles for machine improvisation.
Soft Comput., 8(9):604-610, 2004.
J. Bourdon and I. Rusu. Statistical properties of factor oracles. J.
Discrete Algorithms, 9(1):57-66, 2011.
M. Crochemore, L. Ilie, and E. Seid-Hilmi. The structure of factor oracles.
Int. J. Found. Comput. Sci., 18(4):781-797, 2007.
H. Fan, N. Yao, and H. Ma. Fast variants of the Backward-Oracle-
Marching algorithm. In ICICSE'09, Fourth International Conference
on Internet Computing for Science and Engineering, pages 56-59. IEEE
Computer Society, 2009.
S. Faro and T. Lecroq. Efficient variants of the Backward-Oracle-
Matching algorithm. Int. J. Found. Comput. Sci., 20(6):967-984, 2009.
A. Lefebvre and T. Lecroq. Compror: On-line lossless data compression
with a factor oracle. Inf. Process. Lett., 83(1):1-6, 2002.
A. Mancheron and C. Moan. Combinatorial characterization of the language
recognized by factor and suffix oracles. Int. J. Found. Comput.
Sci., 16(6):1179-1191, 2005.