Ayad I. Abdulsada, Salah Al-Darraji, Dhafer G. Honi

Privacy preserving scheme for document similarity detection

The problem of detecting similar documents plays an essential role for many real-world applications, such as copyright protection and plagiarism detection. To protect data privacy, the new version of such a problem becomes more challenging, where the matched documents are distributed among two or more parties and their privacy should be preserved. In this paper, we propose new privacy-preserving document similarity detection schemes by utilizing the locality-sensitive hashing technique, which can handle the misspelled mistakes. Furthermore, the keywords’ occurrences of a given document are integrated into its underlying representation to support a better ranking for the returned results. We introduced a new security definition, which hides the exact similarity scores towards the querying party. Extensive experiments on real-world data illustrate that our proposed schemes are efficient and accurate.

PDF

___

[1] Marjai P, Lehotay-Kéry P, Kiss A. Document similarity for error prediction. Journal of Information and Telecommunication 2021; 5 (9):1-4. doi: 10.1080/24751839.2021.1893496
[2] Fröbe M, Bevendorff J, Gienapp L, Völske M, Stein B et al. CopyCat: Near-Duplicates within and between the ClueWeb and the Common Crawl. In: Proceedings of the 44th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2021); New York, NY, USA; 2021. pp. 2398–2404.
[3] Ma R, Li Y, Li C, Wan F, Hu H et al. Secure multiparty computation for privacy-preserving drug discovery. Bioinformatics 2020; 36 (9):2872-2880. doi: 10.1093/bioinformatics/btaa038
[4] Abid A, Ali W, Farooq MS, Farooq U, Khan NS et al. Semi-automatic classification and duplicate detection from human loss news corpus. IEEE Access 2020; 8: 97737-97747. doi: 10.1109/ACCESS.2020.2995789
[5] Jiang W, Murugesan M, Clifton C, Si L. Similar document detection with limited information disclosure. In: IEEE 24th International Conference on Data Engineering; Cancun, Mexico; 2008. pp. 735-743.
[6] Murugesan M, Jiang W, Clifton C, Si L, Vaidya J. Efficient privacy-preserving similar document detection. The International Journal on Very Large Databases 2010; 19 (4):457-475. doi: 10.1007/s00778-009-0175-9
[7] Jiang W, Samanthula B. N-gram based secure similar document detection. In: IFIP Annual Conference on Data and Applications Security and Privacy; Berlin, Heidelberg; 2011. pp. 239-246.
[8] Buyrukbilen S, Bakiras S. Secure similar document detection with simhash. In: Workshop on Secure Data Management; Trento, Italy; 2013. pp. 61-75.
[9] Yu X, Chen X, Shi J, Shen L, Wang D. Efficient and scalable privacy-preserving similar document detection. In: IEEE Global Communications Conference; Singapore; 2017. pp. 1-7.
[10] Samanthula B, Jiang W. Secure multiset intersection cardinality and its application to jaccard coefficient. IEEE Transactions on Dependable and Secure Computing 2016; 13 (5):591-604. doi: 10.1109/TDSC.2015.2415482
[11] Forman S, Samanthula B. Secure similar document detection: Optimized computation using the jaccard coefficient. In: IEEE 4th International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing,(HPSC) and IEEE International Conference on Intelligent Data and Security (IDS); Omaha, NE, USA; 2018. pp. 1-4.
[12] Indyk P, Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on Theory of computing; Dallas Texas, USA; 1998. pp. 604-613.
[13] Charikar M. Similarity estimation techniques from rounding algorithms. In: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing; Montreal Quebec, Canada; 2002. pp. 380-388.
[14] Dong C, Chen L, Wen Z. When private set intersection meets big data: an efficient and scalable protocol. In: Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security; New York, NY, USA; 2013. pp. 789-800.
[15] Canetti R, Sarkar P, Wang X. Efficient and round-optimal oblivious transfer and commitment with adaptive security. In: International Conference on the Theory and Application of Cryptology and Information Security; Daejeon, South Korea; 2020. pp. 277-308.
[16] De Cristofaro E, Gasti P, Tsudik G. Fast and private computation of cardinality of set intersection and union. In: International Conference on Cryptology and Network Security; Berlin, Heidelberg; 2012. pp. 218-231.
[17] Bloom B. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 1970; 13 (7): 422-426. doi: 10.1145/362686.362692
[18] Datar M, Immorlica N, Indyk P, Mirrokni V. Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the twentieth annual symposium on Computational geometry; New York, NY, USA; 2004. pp. 253-262.
[19] Gentry C. Fully homomorphic encryption using ideal lattices. In: Proceedings of the forty-first annual ACM symposium on Theory of computing; New York, NY, USA; 2009. pp. 169-178.
[20] Chillotti I, Gama N, Georgieva M, Izabachene M. Faster fully homomorphic encryption: Bootstrapping in less than 0.1 seconds. In: International conference on the theory and application of cryptology and information security; Berlin, Heidelberg; 2016. pp. 3-33.
[21] Schoppmann P, Vogelsang L, Gascón A, Balle B. Secure and Scalable Document Similarity on Distributed Databases: Differential Privacy to the Rescue. Proceedings on Privacy Enhancing Technologies 2020; 2020 (2):209-229. doi: 10.2478/popets-2020-0024
[22] Brakerski Z, Gentry C, Vaikuntanathan V. (leveled) fully homomorphic encryption without bootstrapping. ACM Transactions on Computation Theory 2014; 6 (3):1-36. doi: 10.1145/2090236.2090262
[23] Martins P, Sousa L, Mariano A. A survey on fully homomorphic encryption: An engineering perspective. ACM Computing Surveys 2017; 50 (6):1-33. doi: 10.1145/3124441
[24] Paillier P. Public-key cryptosystems based on composite degree residuosity classes. In: International conference on the theory and applications of cryptographic techniques; Berlin, Heidelberg; 1999. pp. 223-238.
[25] Yao A.C.C. How to generate and exchange secrets. In: 27th Annual Symposium on Foundations of Computer Science; Toronto, ON, Canada; 1986. pp. 162-167.
[26] Goldreich O, Micali S, Wigderson A. How to play any mental game, or a completeness theorem for protocols with honest majority. In: 19th Annual ACM Symposium on Theory of Computing; New York, NA, USA; 1987. pp. 218-229.
[27] Aly A, Orsini E, Rotaru D, Smart NP, Wood T. Zaphod: Efficiently combining LSSS and garbled circuits in SCALE. In: Proceedings of the 7th ACM Workshop on Encrypted Computing & Applied Homomorphic Cryptography; London, UK; 2019. pp. 33-44.
[28] Fan X, Ganesh C, Kolesnikov V. Hashing garbled circuits for free. In: Annual International Conference on the Theory and Applications of Cryptographic Techniques; Paris, France; 2017. pp. 456-485.
[29] Li P, Li T, Yao Z, Tang C, Li J. Privacy-preserving outsourcing of image feature extraction in cloud computing. Soft Computing 2017; 21 (15):4349-4359. doi: 10.1007/s00500-016-2066-5
[30] Veugen T. Improving the DGK comparison protocol. In: 2012 IEEE International Workshop on Information Forensics and Security (WIFS); Costa Adeje, Spain; 2012. pp. 49-54.
[31] Damgård I, Geisler M, Krøigaard M. Efficient and secure comparison for on-line auctions. In: Australasian conference on information security and privacy; Berlin, Heidelberg; 2007. pp. 416-430.
[32] Manning C, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge, England: Cambridge University Press, 2008.
[33] Bost R, Popa R, Tu S, Goldwasser S. Machine learning classification over encrypted data. In: The Network and Distributed System Security Symposium; San Diego, CA, USA; 2015. pp.1-14.
[34] Lindell Y, Pinkas B. Secure multiparty computation for privacy-preserving data mining. Journal of Privacy and Confidentiality 2009; 1 (1):59–98. doi: 10.29012/jpc.v1i1.566
[35] Porter M. An algorithm for suffix stripping. Program: electronic library and information systems 2006; 40 (3):211- 218. doi: 10.1108/00330330610681286
[36] Urbano Merino J, De Lima HA, Hanjalic A. Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’19); Paris, France; 2019. pp. 505–514.