| |
Abstract |
This Ph.D. thesis concerns the problem of indexing techniques for searching in metric spaces. We propose two novel index structures for similarity queries, we have compared them with other approaches and executed numerous experiments to verify their properties.
In order to speedup retrieval in large data collections, index structures partition the data into subsets so that query requests can be evaluated without examining the entire collection. As the complexity of modern data types grows, metric spaces have become a popular paradigm for similarity retrieval. We propose a new index structure, called D-Index, that combines a novel clustering technique and the pivot-based distance searching strategy to speed up execution of similarity range and nearest neighbor queries for large files with objects stored in disk memories. We have qualitatively analyzed D-Index and verified its properties on actual implementation. We have also compared D-Index with other index structures and demonstrated its superiority on several real-life data sets. Contrary to tree organizations, the D-Index structure is suitable for dynamic environments with a high rate of delete/insert operations.
In the next part, we analyze the problem of similarity join which are often applied in many areas, such as data cleaning or copy detection. Though this approach is not able to eliminate the intrinsic quadratic complexity of similarity joins, significant performance improvements are confirmed by experiments. |
|