基于孪生网络和字词向量结合的文本相似度匹配
摘要:文本相似度匹配是许多自然语言处理任务的基础, 本文提出一种基于孪生网络和字词向量结合的文本相似度匹配方法, 采用孪生网络的思想对文本整体建模, 实现两个文本的相似性判断. 首先, 在提取文本特征向量时, 使用BERT和WoBERT模型分别提取字和词级别的句向量, 将二者结合使句向量具有更丰富的文本语义信息; 其次, 针对特征信息融合过程中出现的维度过大问题, 加入PCA算法对高维向量进行降维, 去除冗余信息和噪声干扰; 最后, 通过Softmax分类器得到相似度匹配结果. 通过在LCQMC数据集上的实验表明, 本文模型的准确率和F1值分别达到了89.92%和88.52%, 可以更好地提取文本语义信息, 更适合文本相似度匹配任务.
关键词:
文本相似度匹配 字词向量结合 孪生网络 PCA算法 BERT
Similar Text Matching Based on Siamese Network and Char-word Vector Combination
参考文献
[1] |
董自涛, 包佃清, 马小虎. 智能问答系统中问句相似度计算方法. 武汉理工大学学报·信息与管理工程版, 2010, 32(1): 31-34. |
[2] |
Singh V, Dwivedi SK. Personalized approach for automated question answering in restricted domain. International Journal of Information Technology, 2020, 12(1): 223-229. DOI:10.1007/s41870-018-0200-6 |
[3] |
王灿辉, 张敏, 马少平. 自然语言处理在信息检索中的应用综述. 中文信息学报, 2007, 21(2): 35-45. DOI:10.3969/j.issn.1003-0077.2007.02.006 |
[4] |
贾晓婷, 王名扬, 曹宇. 结合Doc2Vec与改进聚类算法的中文单文档自动摘要方法研究. 数据分析与知识发现, 2018, 2(2): 86-95. |
[5] |
Wang Q, Li B, Xiao T, et al. Learning deep transformer models for machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019. 1810–1822.
|
[6] |
程传鹏, 齐晖. 文本相似度计算在主观题评分中的应用. 计算机工程, 2012, 38(5): 288-290. DOI:10.3969/j.issn.1000-3428.2012.05.089 |
[7] |
Harris ZS. Papers in structural and transformational linguistics. Dordrecht: Springer, 1970, 466-473. |
[8] |
Hofmann T. Probabilistic latent semantic analysis. Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Stockholm: Morgan Kaufmann Publishers Inc., 1999. 289–296.
|
[9] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc. , 2017. 6000–6010.
|
[10] |
Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2019. 4171–4186.
|
[11] |
苏剑林. 提速不掉点: 基于词颗粒度的中文WoBERT. https://kexue.fm/archives/7758. (2020-09-18).
|
[12] |
Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: Association for Computational Linguistics, 2019. 3982–3992.
|
[13] |
Su JL, Cao JR, Liu WJ, et al. Whitening sentence representations for better semantics and faster retrieval. arXiv: 2103.15316, 2021.
|
[14] |
Palangi H, Deng L, Shen Y, et al. Semantic modelling with long-short-term memory for information retrieval. arXiv: 1412.6629, 2014.
|
[15] |
彭浩然. 面向检索式问答的问句语义匹配方法研究[硕士学位论文]. 哈尔滨: 哈尔滨工业大学, 2020.
|