CLR cover

Problem 150: Text index for patterns with don't care symbols

\( \)

For a string $w$ of size $n$ over a (large) integer alphabet $\Sigma$ we want to create a data structure $\D(w)$, called the text index, which allows to search for a pattern $P$ in $w$ in $O(|P|)$ time (usually $|P|\lt\lt n$). The pattern $P$ can contain a single occurrence of a special symbol $\theta\notin \Sigma$ called a don't care or a wildcard, which matches any other symbol in $w$.

The solution is tree-oriented. We consider trees with each internal nodes having at least two children. For each internal node $v$ denote by $T_v$ the subtree rooted at $v$. An edge $v\rightarrow u$, where $v$ is the parent of $u$ is called heavy if $T_u$ contains largest number of leaves among subtrees rooted at children of $v$ (in case of ties we choose a single edge). Other edges are called light. If $v\rightarrow u$ is heavy then subtrees rooted at other children of $v$ are called light subtrees. The set of these trees is denoted by $\Off(v)$.

Observe that each path from a given leaf to the root contains only logarithmically many light edges, hence each leaf belongs to logarithmically many light subtrees. Consequently we have the following fact.

The sum of sizes of all subtrees in $\Off(v)$, over all $v$, is $O(|T|\log |T|)$.

Construct the text index $\D(w)$ of size $O(n \log n)$ with searching time $O(|P|)$.

Use the observation.

References