Text Joins in an RDBMS for Web Data Integration

Luis Gravano

Columbia University

Panagiotis G. Ipeirotis

Columbia University

Nick Koudas

AT&T Labs-Research

Divesh Srivastava

AT&T Labs-Research

Abstract

The integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important and challenging problem. Due to the lack of global identifiers, the same entity (e.g., a product) might have different textual representations across databases. Textual data is also often noisy because of transcription errors, incomplete information, and lack of standard formats. A fundamental task during data integration is matching of strings that refer to the same entity. In this paper, we adopt the widely used and established cosine similarity metric from the information retrieval field in order to identify potential string matches across web sources. We then use this similarity metric to characterize this key aspect of data integration as a join between relations on textual attributes, where the similarity of matches exceeds a specified threshold. Computing an exact answer to the text join can be expensive. For query processing efficiency, we propose a sampling-based join approximation strategy for execution in a standard, unmodified relational database management system (RDBMS), since more and more web sites are powered by RDBMSs with a web-based front end. We implement the join inside an RDBMS, using SQL queries, for scalability and robustness reasons. Finally, we present a detailed performance evaluation of an implementation of our algorithm within a commercial RDBMS, using real-life data sets. Our experimental results demonstrate the efficiency and accuracy of our techniques.

Keywords

indexing, data mining

Introduction

The integration of information from heterogeneous web sources is of central interest for applications such as catalog data integration and warehousing of web data (e.g., job advertisements and announcements). Such data is typically textual and can be obtained from disparate web sources in a variety of ways, including web site crawling and direct access to remote databases via web protocols. The integration of such web data exhibits many semantics- and performance-related challenges.

Consider a price-comparison web site, backed by a database, that combines product information from different vendor web sites and presents the results under a uniform interface to the user. In such a situation, one cannot assume the existence of global identifiers (i.e., unique keys) for products across the autonomous vendor web sites. This raises a fundamental problem: different vendors may use different names to describe the same product. For example, a vendor might list a hard disk as ``Western Digital 120Gb 7200 rpm,'' while another might refer to the same disk as ``Western Digiral HDD 120Gb'' (due to a spelling mistake) or even as ``WD 120Gb 7200rpm'' (using an abbreviation). A simple equality comparison on product names will not properly identify these descriptions as referring to the same entity. This could result in the same product entity from different vendors being treated as separate products, defeating the purpose of the price-comparison web site. To effectively address the integration problem, one needs to match multiple textual descriptions, accounting for:


or combinations thereof.

Any attempt to address the integration problem has to specify a measure that effectively quantifies ``closeness'' or ``similarity'' between string attributes. Such a similarity metric can help establish that ``Microsoft Windows XP Professional'' and ``Windows XP Pro'' correspond to the same product across the web sites/databases, and that these are different from the ``Windows NT'' product. Many approaches to data integration use a text matching step, where similar textual entries are matched together as potential duplicates. Although text matching is an important component of such systems [1,21,23], little emphasis has been paid on the efficiency of this operation.

Once a text similarity metric is specified, there is a clear requirement for algorithms that process the data from the multiple sources to identify all pairs of strings (or sets of strings) that are sufficiently similar to each other. We refer to this operation as a text join. To perform such a text join on data originating at different web sites, we can utilize ``web services'' to fully download and materialize the data at a local relational database management system (RDBMS). Once this materialization has been performed, problems and inconsistencies can be handled locally via text join operations. It is desirable for scalability and effectiveness to fully utilize the RDBMS capabilities to execute such operations.

In this paper, we present techniques for performing text joins efficiently and robustly in an unmodified RDBMS. Our text joins rely on the cosine similarity metric [20], which has been successfully used in the past in the WHIRL system [4] for a similar data integration task. Our contributions include:


The remainder of this paper is organized as follows. Section 2 presents background and notation necessary for the rest of the discussion, and introduces a formal statement of our problem. Section 3 presents SQL statements to preprocess relational tables so that we can apply the sampling-based text join algorithm of Section 4. Then, Section 5 presents the implementation of the text join algorithm in SQL. A preliminary version of Sections 3 and 5 appears in [12]. Section 6 reports a detailed experimental evaluation of our techniques in terms of both accuracy and performance, and in comparison with other applicable approaches. Section 7 discusses the relative merits of alternative string similarity metrics. Section 8 reviews related work. Finally, Section 9 concludes the paper and discusses possible extensions of our work.


Background and Problem

In this section, we first provide notation and background for text joins, which we follow with a formal definition of the problem on which we focus in this paper.

We denote with $\Sigma^{*}$ the set of all strings over an alphabet $\Sigma$. Each string in $\Sigma^{*}$ can be decomposed into a collection of atomic ``entities'' that we generally refer to as tokens. What constitutes a token can be defined in a variety of ways. For example, the tokens of a string could simply be defined as the ``words'' delimited by special characters that are treated as ``separators'' (e.g., ` '). Alternatively, the tokens of a string could correspond to all of its $q$-grams, which are overlapping substrings of exactly $q$ consecutive characters, for a given $q$. Our forthcoming discussion treats the term token as generic, as the particular choice of token is orthogonal to the design of our algorithms. Later, in Section 6 we experiment with different token definitions, while in Section 7 we discuss the effect of token choice on the characteristics of the resulting similarity function.

Let $R_1$ and $R_2$ be two relations with the same or different attributes and schemas. To simplify our discussion and notation we assume, without loss of generality, that we assess similarity between the entire sets of attributes of $R_1$ and $R_2$. Our discussion extends to the case of arbitrary subsets of attributes in a straightforward way. Given tuples $t_1 \in R_1$ and $t_2 \in R_2$, we assume that the values of their attributes are drawn from $\Sigma^{*}$. We adopt the widely used vector-space retrieval model [20] from the information retrieval field to define the textual similarity between $t_1$ and $t_2$.

Let $D$ be the (arbitrarily ordered) set of all unique tokens present in all values of attributes of both $R_1$ and $R_2$. According to the vector-space retrieval model, we conceptually map each tuple $t \in R_i$ to a vector $v_t \in \Re^{\vert D\vert}$. The value of the $j$-th component $v_t(j)$ of $v_t$ is a real number that corresponds to the weight of the $j$-th token of $D$ in $v_t$. Drawing an analogy with information retrieval terminology, $D$ is the set of all terms and $v_t$ is a document weight vector.

Rather than developing new ways to define the weight vector $v_t$ for a tuple $t \in R_i$, we exploit an instance of the well-established tf.idf weighting scheme from the information retrieval field. (tf.idf stands for ``term frequency, inverse document frequency.'') Our choice is further supported by the fact that a variant of this general weighting scheme has been successfully used for our task by Cohen's WHIRL system [4]. Given a collection of documents $C$, a simple version of the tf.idf weight for a term $w$ and a document $d$ is defined as $\mathit{tf_w} \log(idf_w)$, where $\mathit{tf_w}$ is the number of times that $w$ appears in document $d$ and $idf_w$ is $\frac{\vert C\vert}{n_w}$, where $n_w$ is the number of documents in the collection $C$ that contain term $w$. The tf.idf weight for a term $w$ in a document is high if $w$ appears a large number of times in the document and $w$ is a sufficiently ``rare'' term in the collection (i.e., if $w$'s discriminatory power in the collection is potentially high). For example, for a collection of company names, relatively infrequent terms such as ``AT&T'' or ``IBM'' will have higher idf weights than more frequent terms such as ``Inc.''

For our problem, the relation tuples are our ``documents,'' and the tokens in the textual attribute of the tuples are our ``terms.'' Consider the $j$-th token $w$ in $D$ and a tuple $t$ from relation $R_i$. Then $\mathit{tf_w}$ is the number of times that $w$ appears in $t$. Also, $idf_w$ is $\frac{\vert R_i\vert}{n_w}$, where $n_w$ is the total number of tuples in relation $R_i$ that contain token $w$. The tf.idf weight for token $w$ in tuple $t \in R_i$ is $v_t(j)=\mathit{tf_w}\log(idf_w)$. To simplify the computation of vector similarities, we normalize vector $v_t$ to unit length in the Euclidean space after we define it. The resulting weights correspond to the impact of the terms, as defined in [24]. Note that the weight vectors will tend to be extremely sparse for certain choices of tokens; we shall seek to utilize this sparseness in our proposed techniques.

Definition 1   Given tuples $t_1 \in R_1$ and $t_2 \in R_2$, let $v_{t_1}$ and $v_{t_2}$ be their corresponding normalized weight vectors and let $D$ be the set of all tokens in $R_1$ and $R_2$. The cosine similarity (or just similarity, for brevity) of $v_{t_1}$ and $v_{t_2}$ is defined as $sim(v_{t_1}, v_{t_2}) = \sum_{j=1}^{\vert D\vert} v_{t_1}(j) v_{t_2}(j)$ .

Since vectors are normalized, this measure corresponds to the cosine of the angle between vectors $v_{t_1}$ and $v_{t_2}$, and has values between 0 and 1. The intuition behind this scheme is that the magnitude of a component of a vector expresses the relative ``importance'' of the corresponding token in the tuple represented by the vector. Intuitively, two vectors are similar if they share many important tokens. For example, the string ``ACME'' will be highly similar to ``ACME Inc,'' since the two strings differ only on the token ``Inc,'' which appears in many different tuples, and hence has low weight. On the other hand, the strings ``IBM Research'' and ``AT&T Research'' will have lower similarity as they share only one relatively common term.

The following join between relations $R_1$ and $R_2$ brings together the tuples from these relations that are ``sufficiently close'' to each other, according to a user-specified similarity threshold $\phi $:

Definition 2   Given two relations $R_1$ and $R_2$, together with a similarity threshold $0 < \phi \leq 1$, the text join $R_1 \widetilde{\Join}_{\phi}R_2$ returns all pairs of tuples $(t_1, t_2)$ such that $t_1 \in R_1$ and $t_2 \in R_2$, and $sim(v_{t_1}, v_{t_2})\geq \phi$.

The text join ``correlates'' two relations for a given similarity threshold $\phi $. It can be easily modified to correlate arbitrary subsets of attributes of the relations. In this paper, we address the problem of computing the text join of two relations efficiently and within an unmodified RDBMS:

Problem 1   Given two relations $R_1$ and $R_2$, together with a similarity threshold $0 < \phi \leq 1$, we want to efficiently compute (an approximation of) the text join $R_1 \widetilde{\Join}_{\phi}R_2$ using ``vanilla'' SQL in an unmodified RDBMS.

In the sequel, we first describe our methodology for deriving, in a preprocessing step, the vectors corresponding to each tuple of relations $R_1$ and $R_2$ using relational operations and representations. We then present a sampling-based solution for efficiently computing the text join of the two relations using standard SQL in an RDBMS.


Tuple Weight Vectors

In this section, we describe how we define auxiliary relations to represent tuple weight vectors, which we later use in our purely-SQL text join approximation strategy.

As in Section 2, assume that we want to compute the text join $R_1 \widetilde{\Join}_{\phi}R_2$ of two relations $R_1$ and $R_2$. $D$ is the ordered set of all the tokens that appear in $R_1$ and $R_2$. We use SQL expressions to create the weight vector associated with each tuple in the two relations. Since -for some choice of tokens- each tuple is expected to contain only a few of the tokens in $D$, the associated weight vector is sparse. We exploit this sparseness and represent the weight vectors by storing only the tokens with non-zero weight. Specifically, for a choice of tokens (e.g., words or $q$-grams), we create the following relations for a relation $R_i$:

Figure 1: Preprocessing SQL statements to create auxiliary relations for relation $R_i$.

\fbox{ \begin{tabular}{ll} \begin{minipage}[t]{3in} \begin{tabbing} {\tt INS... ...oken weights & (f) Dummy relation used to create \emph{RiIDF} \end{tabular} }

Given two relations $R_1$ and $R_2$, we can use the SQL statements in Figure 1 to generate relations R1Weights and R2Weights with a compact representation of the weight vector for the $R_1$ and $R_2$ tuples. Only the non-zero tf.idf weights are stored in these tables. Interestingly, RiWeights and RiSum are the only tables that need to be preserved for the computation of $R_1 \widetilde{\Join}_{\phi}R_2$ that we describe in the remainder of the paper: all other tables are just necessary to construct RiWeights and RiSum. The space overhead introduced by these tables is moderate. Since the size of RiSum is bounded by the size of RiWeights, we just analyze the space requirements for RiWeights.

Consider the case where $q$-grams are the tokens of choice. (As we will see, a good value is $q=3$.) Then each tuple $R_i.t_j$ of relation $R_i$ can contribute up to approximately $\vert R_i.t_j\vert$ $q$-grams to relation RiWeights, where $\vert R_i.t_j\vert$ is the number of characters in $R_i.t_j$. Furthermore, each tuple in RiWeights consists of a tuple id tid, the actual token (i.e., $q$-gram in this case), and its associated weight. Then, if $C$ bytes are needed to represent tid and weight, the total size of relation RiWeights will not exceed $\sum_{j=1}^{\vert R_i\vert} (C+q)\cdot \vert R_i.t_j\vert = (C+q)\cdot \sum_{j=1}^{\vert R_i\vert} \vert R_i.t_j\vert$ , which is a (small) constant times the size of the original table $R_i$. If words are used as the token of choice, then we have at most $\frac{\vert R_i.t_j\vert}{2}$ tokens per tuple in $R_i$. Also, to store the token attribute of RiWeights we need no more than one byte for each character in the $R_i.t_j$ tuples. Therefore, we can bound the size of RiWeights by $1+\frac{C}{2}$ times the size of $R_i$. Again, in this case the space overhead is linear in the size of the original relation $R_i$.

Given the relations R1Weights and R2Weights, a baseline approach [13,18] to compute $R_1 \widetilde{\Join}_{\phi}R_2$ is shown in Figure 2. This SQL statement performs the text join by computing the similarity of each pair of tuples and filtering out any pair with similarity less than the similarity threshold $\phi $. This approach produces an exact answer to $R_1 \widetilde{\Join}_{\phi}R_2$ for $\phi>0$. Unfortunately, as we will see in Section 6, finding an exact answer with this approach is prohibitively expensive, which motivates the sampling-based technique that we describe next.

Figure 2: Baseline approach for computing the exact value of $R_1 \widetilde{\Join}_{\phi}R_2$.

\fbox{ \begin{minipage}{2.7in} \begin{tabbing} {\tt SELECT    r1w.tid AS tid1... ...VING    SUM(r1w.weight*r2w.weight)$\geq \phi$} \end{tabbing} \end{minipage} }


Sampling-Based Text Joins

The result of $R_1 \widetilde{\Join}_{\phi}R_2$ only contains pairs of tuples from $R_1$ and $R_2$ with similarity $\phi $ or higher. Usually we are interested in high values for threshold $\phi $, which should result in only a few tuples from $R_2$ typically matching each tuple from $R_1$. The baseline approach in Figure 2, however, calculates the similarity of all pairs of tuples from $R_1$ and $R_2$ that share at least one token. As a result, this baseline approach is inefficient: most of the candidate tuple pairs that it considers do not make it to the final result of the text join. In this section, we describe a sampling-based technique [2] to execute text joins efficiently, drastically reducing the number of candidate tuple pairs that are considered during query processing.

The sampling-based technique relies on the following intuition: $R_1 \widetilde{\Join}_{\phi}R_2$ could be computed efficiently if, for each tuple $t_q$ of $R_1$, we managed to extract a sample from $R_2$ containing mostly tuples suspected to be highly similar to $t_q$. By ignoring the remaining (useless) tuples in $R_2$, we could approximate $R_1 \widetilde{\Join}_{\phi}R_2$ efficiently. The key challenge then is how to define a sampling strategy that leads to efficient text join executions while producing an accurate approximation of the exact query results. The discussion of the technique is organized as follows:


The sampling algorithm described in this section is an instance of the approximate matrix multiplication algorithm presented in [2], which computes an approximation of the product $A=A_1 \cdot \ldots \cdot A_n$, where each $A_i$ is a numeric matrix. (In our problem, $n=2$.) The actual matrix multiplication $A'=A_2 \cdot \ldots \cdot A_n$ happens during a preprocessing, off-line step. Then, the on-line part of the algorithm works by processing the matrix $A_1$ row by row.

Token-Weighted Sampling

Consider tuple $t_q \in R_1$ with its associated token weight vector $v_{t_q}$, and each tuple $t_i \in R_2$ with its associated token weight vector $v_{t_i}$. When $t_q$ is clear from the context, to simplify the notation we use $\sigma_i$ as shorthand for $sim(v_{t_q},v_{t_i})$. We extract a sample of $R_2$ tuples of size $S$ for $t_q$ as follows:

Let $C_i$ be the number of times that $t_i$ appears in the sample of size $S$. It follows that:

Theorem 1     The expected value of $\frac{C_i}{S} \cdot T_V(t_q)$ is $\sigma_i$.$\Box$

The proof of this theorem follows from an argument similar to that in [2] and from the observation that the mean of the process that generates $C_i$ is $\frac{\sum_{j=1}^{\vert D\vert} v_{t_q}(j)v_{t_i}(j)}{T_V(t_q)} = \frac{\sigma_i}{T_V(t_q)}$ .

Theorem 1 establishes that, given a tuple $t_q \in R_1$, we can obtain a sample of size $S$ of tuples $t_i$ such that the frequency $C_i$ of tuple $t_i$ can be used to approximate $\sigma_i$. We can then report $\langle t_q, t_i \rangle$ as part of the answer of $R_1 \widetilde{\Join}_{\phi}R_2$ for each tuple $t_i \in R_2$ such that its estimated similarity with $t_q$ (i.e., its estimated $\sigma_i$) is $\phi'$ or larger, where $\phi' = (1-\epsilon)\phi$ is a threshold slightly lower1 than $\phi $.

Given $R_1$, $R_2$, and a threshold $\phi $, our discussion suggests the following strategy for the evaluation of the $R_1 \widetilde{\Join}_{\phi}R_2$ text join, in which we process one tuple $t_q \in R_1$ at a time:

This strategy guarantees that we can identify all pairs of tuples with similarity of at least $\phi $, with a desired probability, as long as we choose an appropriate sample size $S$. So far, the discussion has focused on obtaining an $R_2$ sample of size $S$ individually for each tuple $t_q \in R_1$. A naive implementation of this sampling strategy would then require a scan of relation $R_2$ for each tuple in $R_1$, which is clearly unacceptable in terms of performance. In the next section we describe how the sampling can be performed with only one sequential scan of relation $R_2$.


Practical Realization of Sampling

As discussed so far, the sampling strategy requires extracting a separate sample from $R_2$ for each tuple in $R_1$. This extraction of a potentially large set of independent samples from $R_2$ (i.e., one per $R_1$ tuple) is of course inefficient, since it would require a large number of scans of the $R_2$ table. In this section, we describe how to adapt the original sampling strategy so that it requires one single sample of $R_2$, following the ``presampling'' implementation in [2]. We then show how to use this sample to create an approximate answer for the text join $R_1 \widetilde{\Join}_{\phi}R_2$.

As we have seen in the previous section, for each tuple $t_q \in R_1$ we should sample a tuple $t_i$ from $R_2$ in a way that depends on the $v_{t_q}(j) \cdot v_{t_i}(j)$ values. Since these values are different for each tuple of $R_1$, a straightforward implementation of this sampling strategy requires multiple samples of relation $R_2$. Here we describe an alternative sampling strategy that requires just one sample of $R_2$: First, we sample $R_2$ using only the $v_{t_i}(j)$ weights from the tuples $t_i$ of $R_2$, to generate a single sample of $R_2$. Then, we use the single sample differently for each tuple $t_q$ of $R_1$. Intuitively, we ``weight'' the tuples in the sample according to the weights $v_{t_q}(j)$ of the $t_q$ tuples of $R_1$. In particular, for a desired sample size $S$ and a target similarity $\phi $, we realize the sampling-based text join $R_1 \widetilde{\Join}_{\phi}R_2$ in three steps:


  1. Sampling: We sample the tuple ids $i$ and the corresponding tokens from the vectors $v_{t_i}$ for each tuple $t_i \in R_2$. We sample each token $j$ from a vector $v_{t_i}$ with probability $\frac{v_{t_i}(j)}{Sum(j)}$. (We define $Sum(j)$ as the total weight of the $j$-th token in relation $R_2$, $Sum(j) = \sum_{i = 1}^{\vert R_2\vert} v_{t_i}(j)$. These weights are kept in relation R2Sum.) We perform $S$ trials, yielding approximately $S$ samples for each token $j$. We insert into $\mathit{R2Sample}$ tuples of the form $\langle i,j \rangle$ as many times as there were successful trials for the pair. Alternatively, we can create tuples of the form $\langle i,j,c \rangle$, where $c$ is the number of successful trials. This results in a compact representation of $\mathit{R2Sample}$, which is preferable in practice.


  2. Weighting: The Sampling step uses only the token weights from $R_2$ for the sampling, ignoring the weights of the tokens in the other relation, $R_1$. The cosine similarity, however, uses the products of the weights from both relations. During the Weighting step we use the token weights in the non-sampled relation to get estimates of the cosine similarity, as follows. For each $\mathit{R2Sample}$ tuple $\langle i,j \rangle$, with $c$ occurrences in the table, we compute the value $v_{t_q}(j) \cdot Sum(j) \cdot c$, which is an approximation of $v_{t_q}(j) \cdot v_{t_i}(j) \cdot S$. We add this value to a running counter that keeps the estimated similarity of the two tuples $t_q$ and $t_i$. The Weighting step thus departs from the strategy in [2], for efficiency reasons, in that we do not use sampling during the join processing.


  3. Thresholding: After the Weighting step, we include the tuple pair $\langle t_q, t_i \rangle$ in the final result only if its estimated similarity is no lower than the user-specified threshold (Section 4.1).


Such a sampling scheme identifies tuples with similarity no less than $\phi $ from $R_2$ for each tuple in $R_1$. By sampling $R_2$ only once, the sample will be correlated. As we verify experimentally in Section 6, this sample correlation has a negligible effect on the quality of the join approximation.

As presented, the join-approximation strategy is asymmetric in the sense that it uses tuples from one relation ($R_1$) to weight samples obtained from the other ($R_2$). The text join problem, as defined, is symmetric and does not distinguish or impose an ordering on the operands (relations). Hence, the execution of the text join $R_1 \widetilde{\Join}_{\phi}R_2$ naturally faces the problem of choosing which relation to sample. For a specific instance of the problem, we can break this asymmetry by executing the approximate join twice. Thus, we first sample from vectors of $R_2$ and use $R_1$ to weight the samples. Then, we sample from vectors of $R_1$ and use $R_2$ to weight the samples. Then, we take the union of these as our final result. We refer to this as a symmetric text join. We will evaluate this technique experimentally in Section 6.

In this section we have described how to approximate the text join $R_1 \widetilde{\Join}_{\phi}R_2$ by using weighted sampling. In the next section, we show how this approximate join can be completely implemented in a standard, unmodified RDBMS.


Sampling and Joining Tuple Vectors in SQL

We now describe our SQL implementation of the sampling-based join algorithm of Section 4.2. Section 5.1 addresses the Sampling step, while Section 5.2 focuses on the Weighting and Thresholding steps for the asymmetric versions of the join. Finally, Section 5.3 discusses the implementation of a symmetric version of the approximate join.


Implementing the Sampling Step in SQL

Figure 3: Creating an auxiliary relation that we sample to create RiSample(tid,token).

\fbox{ \begin{minipage}{2.7in} \begin{tabbing} {\tt SELECT  rw.tid, rw.token,... ...iSum rs}\ {\tt WHERE   rw.token = rs.token} \end{tabbing} \end{minipage} }

Given the $\mathit{RiWeights}$ relations, we now show how to implement the Sampling step of the text join approximation strategy (Section 4.2) in SQL. For a desired sample size $S$ and similarity threshold $\phi $, we create the auxiliary relation shown in Figure 3. As the SQL statement in the figure shows, we join the relations $\mathit{RiWeights}$ and $\mathit{RiSum}$ on the token attribute. The $P$ attribute for a tuple in the result is the probability $\mathit{\frac{RiWeights.weight}{RiSum.total}}$ with which we should pick this tuple (Section 4.2). Conceptually, for each tuple in the output of the query of Figure 3 we need to perform $S$ trials, picking each time the tuple with probability $P$. For each successful trial, we insert the corresponding tuple $\langle tid,token \rangle$ in a relation $\mathit{RiSample(tid,token)}$, preserving duplicates. The $S$ trials can be implemented in various ways. One (expensive) way to do this is as follows: We add ``AND P $\geq$ RAND()'' in the WHERE clause of the Figure 3 query, so that the execution of this query corresponds to one ``trial.'' Then, executing this query $S$ times and taking the union of the all results provides the desired answer. A more efficient alternative, which is what we implemented, is to open a cursor on the result of the query in Figure 3, read one tuple at a time, perform $S$ trials on each tuple, and then write back the result. Finally, a pure-SQL ``simulation'' of the Sampling step deterministically defines that each tuple will result in Round( $S \cdot \mathit{\frac{RiWeights.weight}{RiSum.total}}$) ``successes'' after $S$ trials, on average. This deterministic version of the query is shown in Figure 42. We have implemented and run experiments using the deterministic version, and obtained virtually the same performance as with the cursor-based implementation of sampling over the Figure 3 query. In the rest of the paper -to keep the discussion close to the probabilistic framework- we use the cursor-based approach for the Sampling step.

Figure 4: A deterministic version of the Sampling step, which results in a compact representation of RiSample.

\fbox{ \begin{minipage}{2.7in} \begin{tabbing} {\tt INSERT INTO RiSample(tid,... ...        ROUND($S$ * rw.weight/rs.total, 0)>0} \end{tabbing} \end{minipage} }


Implementing the Weighting and Thresholding Steps in SQL

Figure 5: Implementing the Weighting and Thresholding steps in SQL. This query corresponds to the asymmetric execution of the sampling-based text join, where we sample $R_2$ and weight the sample using $R_1$.

\fbox{ \begin{minipage}{2.5in} \begin{tabbing} {\tt SELECT   r1w.tid AS tid1,... ...sum.total/r1v.Tv) $\!\!\geq\! S*\phi'$/r1v.Tv} \end{tabbing} \end{minipage} }

Section 4.2 described the Weighting and Thresholding steps as two separate steps. In practice, we can combine them into one SQL statement, shown in Figure 5. The Weighting step is implemented by the SUM aggregate in the HAVING clause. We weight each tuple from the sample according to $\frac{R1Weights.weight \cdot R2Sum.total}{R1V.T_V}$, which corresponds to $\frac{v_{t_q}(j)\cdot Sum(j)}{T_V(t_q)}$ (see Section 4.2)3. Then, we can count the number of times that each particular tuple pair appears in the results (see GROUP BY clause). For each group, the result of the SUM is the number of times $C_i$ that a specific tuple pair appears in the candidate set. To implement the Thresholding step, we apply the count filter as a simple comparison in the HAVING clause: we check whether the frequency of a tuple pair at least matches the count threshold (i.e., $C_i > \frac{S}{T_V(t_q)}\phi^{'}$). The final output of this SQL operation is a set of tuple id pairs with expected similarity of at least $\phi $. The SQL statement in Figure 5 can be further simplified by completely eliminating the join with the $R1V$ relation. The $R1V.T_V$ values are used only in the HAVING clause, to divide both parts of the inequality. The result of the inequality is not affected by this division, hence the $R1V$ relation can be eliminated when combining the Weighting and the Thresholding step into one SQL statement.


Implementing a Symmetric Text Join Approximation in SQL

Figure 6: A symmetric sampling-based text join $R_1 \widetilde{\Join}_{\phi}R_2$.

\fbox{ \begin{scriptsize} \begin{minipage}{2.7in} \begin{tabbing} {\tt SELEC... ...G AVG(Ci) $\geq$ $S*\phi'$} \end{tabbing} \end{minipage} \end{scriptsize} }

Up to now we have described only an asymmetric text join approximation approach, in which we sample relation $R_2$ and weight the samples according to the tuples in $R_1$ (or vice versa). However, as we described in Section 4.2, the text join $R_1 \widetilde{\Join}_{\phi}R_2$ treats $R_1$ and $R_2$ symmetrically. To break the asymmetry of our sampling-based strategy, we execute the two different asymmetric approximations and report the union of their results, as shown in Figure 6. Note that a tuple pair $\langle tid1, tid2 \rangle$ that appears in the result of the two intervening asymmetric approximations needs high combined ``support'' to qualify in the final answer (see HAVING clause in Figure 64).

An additional strategy naturally suggests itself: Instead of executing the symmetric join algorithm by joining the samples with the original relations, we can just join the samples, ignoring the original relations. We sample each relation independently, join the samples, and then weight and threshold the output. We implement the Weighting step by weighting each tuple with $\frac{R1Sum.total}{R1V.T_V}\cdot \frac{R2Sum.total}{R2V.T_V}$ . The count threshold in this case becomes $C_i > \frac{S\cdot S}{T_V(t_q)\cdot T_V(t_i)}\phi^{'}$ (again the $T_V$ values can be eliminated from the SQL implementation if we combine the Weighting and the Thresholding steps). Figure 7 shows the SQL implementation of this version of the sampling-based text join.

Figure 7: A symmetric sampling-based text join $R_1 \widetilde{\Join}_{\phi}R_2$ involving only the relation samples.

\fbox{ \begin{minipage}{2.5in} \begin{tabbing} {\tt SELECT   r1s.tid AS tid1,... ...M(r1sum.total * r2sum.total) $\geq S*S*\phi'$} \end{tabbing} \end{minipage} }


Experimental Evaluation

We implemented the proposed SQL-based techniques and performed a thorough experimental evaluation in terms of both accuracy and performance in a commercial RDBMS. In Section 6.1, we describe the techniques that we compare and the data sets and metrics that we use for our experiments. Then, we report experimental results in Section 6.2.


Experimental Settings

We implemented the schema and the relations described in Section 3 in a commercial RDMBS, Microsoft SQL Server 2000, running on a multiprocessor machine with 2x2Ghz Xeon CPUs and with 2Gb of RAM. SQL Server was configured to potentially utilize the entire RAM as a buffer pool. We also compared our SQL solution against WHIRL, an alternative stand-alone technique, not available under Windows, using a PC with 2Gb of RAM, 2x1.8Ghz AMD Athlon CPUs and running Linux.

Data Sets: For our experiments, we used real data from an AT&T customer relationship database. We extracted from this database a random sample of 40,000 distinct attribute values of type string. We then split this sample into two data sets, $R_1$ and $R_2$. Data set $R_1$ contains about 14,000 strings, while data set $R_2$ contains about 26,000 strings. The average string length for $R_1$ is 19 characters and, on average, each string consists of 2.5 words. The average string length for $R_2$ is 21 characters and, on average, each string consists of 2.5 words. The length of the strings follows a close-to-Gaussian distribution for both data sets and is reported in Figure 8(a), while the size of $R_1 \widetilde{\Join}_{\phi}R_2$ for different similarity thresholds $\phi $ and token choices is reported in Figure 8(b). We briefly discuss experiments over other data sets later in this section.

Figure 8: Characteristics of the two data sets $R_1$ and $R_2$.
  \includegraphics[width=0.9\columnwidth]{figs/lengths.eps}  
  (a)
String lengths in data sets $R_1$ and $R_2$.
 
  \includegraphics[width=0.9\columnwidth]{figs/distances.eps}  
  (b)
The size of $R_1 \widetilde{\Join}_{\phi}R_2$ for different similarity thresholds and token choices.
 


Metrics: To evaluate the accuracy and completeness of our techniques we use the standard precision and recall metrics:

Definition 3   Consider two relations $R_1$ and $R_2$ and a user-specified similarity threshold $\phi $. Let $\mathit{Answer}_{\phi}$ be an approximate answer for text join $R_1 \widetilde{\Join}_{\phi}R_2$. Then, the precision and recall of $\mathit{Answer}_{\phi}$ with respect to $R_1 \widetilde{\Join}_{\phi}R_2$ are defined as:

\begin{eqnarray}\html{eqn0} \mathit{precision} & = &\frac{\vert \mathit{Answer}... ...t}{\vert\mathit{R_1 \widetilde{\Join}_{\phi}R_2\vert}} \nonumber \end{eqnarray}


Precision and recall can take values in the 0-to-1 range. Precision measures the accuracy of the answer and indicates the fraction of tuples in the approximation of $R_1 \widetilde{\Join}_{\phi}R_2$ that are correct. In contrast, recall measures the completeness of the answer and indicates the fraction of the $R_1 \widetilde{\Join}_{\phi}R_2$ tuples that are captured in the approximation. We believe that recall is more important than precision. The returned answer can always be checked for false positives in a post-join step, while we cannot locate false negatives without re-running the text join algorithm.

Finally, to measure the efficiency of the algorithms, we measure the actual execution time of the text join for different techniques.

Choice of Tokens: We present experiments for different choices of tokens for the similarity computation. (Section 7 discusses the effect of the token choice on the resulting similarity function.) The token types that we consider in our experiments are:


The R1Weights table has 30,933 rows for Words, 268,458 rows for Q-grams with $q=3$, and 245,739 rows for Q-grams with $q=2$. For the R2Weights table, the corresponding numbers of rows are 61,715, 536,982, and 491,515. In Figure 8(b) we show the number of tuple pairs in the exact result of the text join $R_1 \widetilde{\Join}_{\phi}R_2$, for the different token choices and for different similarity thresholds $\phi $.

Techniques Compared: We compare the following algorithms for computing (an approximation of) $R_1 \widetilde{\Join}_{\phi}R_2$. All of these algorithms can be deployed completely within an RDBMS:


In addition, we also compare the SQL-based techniques against the stand-alone WHIRL system [4]. Given a similarity threshold