CLR cover
Previous problem

Problem 150: Text index for patterns with don't care symbols

\( \)

For a string $w$ of size $n$ over a (large) integer alphabet $\Sigma$ we want to create a data structure $\D(w)$, of size $O(n\log n)$, called the text index, which allows to search for a pattern $P$ in $w$ in $O(|P|)$ time (usually $|P|\lt\lt n$). The pattern $P$ can contain a single occurrence of a special symbol $\theta\notin \Sigma$ called a don't care or a wildcard, which matches any other symbol in $w$. In this simplified problem we do not ask about time complexity of constructing $\D(w)$, since it is quite technical (see the notes). Our main aim here is only a small size of $\D(w)$ and fast searching of the pattern.

Combinatorics of trees

We consider only trees with each internal nodes having at least two children. By a size of a tree we mean the number of its leaves. For each internal node $v$ denote by $T_v$ the subtree rooted at $v$. An edge $v\rightarrow u$, where $v$ is the parent of $u$ is called heavy if $T_u$ has largest size among subtrees rooted at children of $v$ (in case of ties we choose a single edge). Other edges are called light. If $v\rightarrow u$ is heavy then subtrees rooted at other children of $v$ are called light subtrees.

Observe that each path from a given leaf to the root contains only logarithmically many light edges, hence each leaf belongs to logarithmically many light subtrees. Consequently we have the following fact.

The sum of sizes of all light subtrees in $\Off(v)$ is $O(|T|\log |T|)$.

Construct the text index $\D(w)$ of size $O(n \log n)$ with searching time $O(|P|)$.

Use the observation.

References