java - efficient algorithm to compare similarity between sets of numbers? -


I have a large number of numbers, each set has 10 numbers and I need to delete all those set whose Near 5 or more numbers (unroded) matches are from any other set.

For example:

  set 1: {12,14,222,998,1,89,43,22,7654,23} set 2: {44,23,64 , 76,987,3,2345,443,431,88} set 3: {998,22,7654,345,112,32, 89,9842,31,23}  

above the set number of 10 Looking at the 3 set and Set 3 will be considered a duplicate because they have 5 mail numbers, in this case I will delete set 3 (because it is considered equal to 1 set).

I have more than 10000 sets to compare and I want to do it very efficiently. I'm ending it and I can not think of it in an efficient manner (it would be good to do this at a single pass).

Any ideas? Thanks!

Mike

You should reconsider your requirements because it is in operation Also, do not have a well-defined result, for example, set this:

  set 1: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10 } Set 2: {6, 7, 8, 9, 10, 11, 12, 13, 14, 15} Set 3: {11, 12, 13, 14, 15, 16, 17, 18, 19, 20} / Code> 

If you consider 1 and 2 for the first time as "duplicate" and finish 1 set, then 2 and 3 are also "duplicate" and you only have one set Going live Are there. But if you eliminate 2 sets instead, 1 and 3 do not have any matches and you have two sets remaining.

You can easily extend it to your 10,000 sets so that it can be possible that at which you compare and finish first, you can leave only one set or with 5000 sets. . I do not think you want what you want.

Your problem is that you are trying to find, but you do not use "equality" to define the relationship. Specifically, it is not transitive to the common man's condition, if set A is "equal" to set "set" and set to "set" to set C, then your definition is not sure Does that also have "equal" to A, and so you can not end the same set.

Before you are worried about an efficient implementation, you need to first clarify your needs to deal with this problem. Either find a way to define a traditional similarity, or keep all the sets and only work with comparisons (or with a list of sets equal to each single set).


Comments

Popular posts from this blog

c++ - Linux and clipboard -

What is expire header and how to achive them in ASP.NET and PHP? -

sql server - How can I determine which of my SQL 2005 statistics are unused? -