LC-ATSR: A Transformer-Based Approach for Semantic Retrieval in Lakehouse Data Platforms
Abstract
This study proposes a Lakehouse Collaborative Adaptive Transformer Semantic Representation Algorithm (LC-ATSR) to address the core challenge of semantic retrieval of unstructured data in a Lakehouse integrated data platform. The algorithm uses a transformer as its backbone and incorporates a dual-dimensional adaptive mechanism of storage attributes and data types to construct an integrated "preprocessing-representation-retrieval" framework. The core designs include a storage attribute-enhanced attention mechanism that integrates Lake storage tags into self-attention computation to adapt to heterogeneous storage characteristics; a lightweight semantic compression and multi-modal alignment module that reduces vector dimensions (from 768 to 256) while constructing a unified semantic space; and a Lakehouse incremental semantic indexing mechanism that dynamically updates the index based on LSH hashing. The experimental results show that the LC-ATSR algorithm achieves a P@10 score of 89.7% and an F1 score of 88.2%, significantly outperforming mainstream algorithms such as BERT, DPR, and RoBERTa. The single-retrieval latency was 18.3ms, the incremental index construction time was reduced by 68.4% compared with the full-data approach, and the accuracy fluctuation was only 3.2% in heterogeneous data scenarios. The retrieval system built based on this algorithm achieves a 99.8% functional pass rate, a throughput of 286 QPS with 500 concurrent connections, and a response time of 21.3ms, meeting the engineering requirements of enterprise data platforms and providing technical support for the value mining of unstructured data.DOI:
https://doi.org/10.31449/inf.v50i11.13673Downloads
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







