mercredi 25 décembre 2019

How to test text similarities approaches?

My goal is to benchmark different text similarities approaches for detecting short sentences duplicates. I have applied different methods, simple edit distances and also semantic text similarities techniques. I have prepared a golden dataset that consists of 500 text pairs that I consider duplicates. When I apply different text similarities I get scores between 0 and 1 for each pair. I am confused about what metrics to use to compare the performance of these techniques and what threshold score to use for classifying as duplicate and as not duplicate.

Thanks in advance!

Aucun commentaire:

Enregistrer un commentaire