|
Problem 150: Text index for patterns with don't care symbols
|
\(
\)
For a string $w$ of size $n$ over a (large) integer alphabet
$\Sigma$ we want to
create a data structure $\D(w)$, of size $O(n\log n)$, called the text index, which allows to
search for a pattern $P$ in $w$ in $O(|P|)$ time
(usually $|P|\lt\lt n$).
The pattern $P$ can contain
a single occurrence of a special symbol $\theta\notin \Sigma$ called a don't care
or a wildcard, which matches any other symbol in $w$.
In this simplified problem we do not ask about time complexity of constructing $\D(w)$,
since it is quite technical (see the notes).
Our main aim here is only a small size of $\D(w)$ and fast searching of the pattern.
Combinatorics of trees
We consider only trees with each internal nodes having at least two children.
By a size of a tree we mean the number of its leaves.
For each internal node $v$ denote by $T_v$ the subtree rooted at $v$.
An edge $v\rightarrow u$, where $v$ is the
parent of $u$ is called
heavy if $T_u$ has largest size among subtrees rooted
at children of $v$ (in case of ties we choose a single edge).
Other edges are called
light.
If $v\rightarrow u$ is heavy then subtrees rooted at other children of $v$ are
called
light subtrees.
Observe that each path from a given leaf to the root contains only
logarithmically many light edges,
hence each leaf belongs to logarithmically many light subtrees.
Consequently we have the following fact.
The sum of sizes of all light subtrees in $\Off(v)$
is $O(|T|\log |T|)$.
Construct the text index $\D(w)$ of size $O(n \log n)$ with searching time $O(|P|)$.
Use the observation.
References
-
R. Cole, L. Gottlieb, and M. Lewenstein,
Dictionary matching and indexing with errors and don't cares,
In L. Babai, editor, Proceedings of
the 36th Annual ACM Symposium on Theory of Computing, Chicago,
IL, USA, pages 91-100. ACM, 2004.