Monday, March 28, 2011

Get Common Rows Within The Same Table

I've had a bit of a search, but didn't find anything quite like what I'm trying to achieve.

Basically, I'm trying to find a similarity between two users' voting habits.

I have a table storing each individual vote made, which stores:

voteID
itemID     (the item the vote is attached to)
userID     (the user who voted)
direction  (whether the user voted the post up, or down)

I'm aiming to calculate the similarity between, say, users A and B, by finding out two things:

  1. The number of votes they have in common. That is, the number of times they've both voted on the same post (the direction does not matter at this point).
  2. The number of times they've voted in the same direction, on common votes.

(Then simply to calculate #2 as a percentage of #1, to achieve a crude similarity rating).

My question is, how do I find the intersection between the two users' sets of votes? (i.e. how do I calculate point #1 adequately, without looping over every vote in a highly inefficient way.) If they were in different tables, an INNER JOIN would suffice, I'd imagine... but that obviously won't work on the same table (or will it?).

Any ideas would be greatly appreciated.

From stackoverflow
  • You most certainly can join a table to itself. In fact, that's what you're going to have to do. You must use aliasing when joining a table to itself. If your table doesn't have a PK or FK, you'll have to use Union instead. Union will remove duplicates and Union All will not.

    James Burgess : But how do you go about joining it on the same column with the differing user IDs? Any chance you could give an example - as I'm obviously overlooking something, here...
    GregD : I was going to, but ninesided beat me too it.
    ninesided : my bad, your post appeared halfway through me typing mine!
    GregD : no problems. Your post was good! :)
  • Something like this:

    SELECT COUNT(*)
    FROM votes v1
    INNER JOIN votes v2 ON (v1.item_id = v2.item_id)
    WHERE v1.userID = 'userA'
    AND v2.userUD = 'userB'
    
    James Burgess : Just what I needed, thanks.
  • Here's an example that should get you closer:

    SELECT COUNT(*)
    FROM (
          SELECT u1.userID
          FROM vote u1, vote u2
          WHERE u1.itemID = u2.itemID
          AND u1.userID = user1
          AND u2.userID = user2)
    
    ninesided : old school join syntax
    Lance Roberts : That could be, you're welcome to point me out to some article on joins that explains why one way is better than the other. I don't have tons of SQL experience.
    ninesided : It's all about readability and ease of comprehension. Check out this answer: http://bit.ly/ansijoin
    Lance Roberts : Thanks, that was a good link. I'll upvote your answer.
  • Assuming userID 1 being compared to userID 2

    For finding how many votes they have in common:

    SELECT COUNT(*)
    FROM Votes AS v1
    INNER JOIN Votes AS v2 ON (v2.userID = 2
                                AND v2.itemID = v1.itemID)
    WHERE v1.userID = 1;
    

    For finding when they also voted the same:

    SELECT COUNT(*)
    FROM Votes AS v1
    INNER JOIN Votes AS v2 ON (v2.userID = 2
                                AND v2.itemID = v1.itemID
                                AND v2.direction = v1.direction)
    WHERE v1.userID = 1;
    
  • A self join is in order. Here it is with all you asked:

    SELECT v1.userID user1, v2.userID user2,
      count(*) n_votes_in_common,
      sum(case when v1.direction = v2.direction then 1 else 0 end) n_votes_same_direction,
      (n_votes_same_direction * 100.0 / n_votes_in_common) crude_similarity_percent 
    FROM votes v1
    INNER JOIN votes v2
    ON v1.item_id = v2.item_id
    
  • In case you want to do this for a single user (rather than knowing both users at the start) to find to whom they are the closest match:

    SELECT
         v2.userID,
         COUNT(*) AS matching_items,
         SUM(CASE WHEN v2.direction = v1.direction THEN 1 ELSE 0 END) AS matching_votes
    FROM
         Votes v1
    INNER JOIN Votes v2 ON
         v2.userID <> v1.userID AND
         v2.itemID = v1.itemID
    WHERE
         v1.userID = @userID
    GROUP BY
         v2.userID
    

    You can then limit that however you see fit (return the top 10, top 20, all, etc.)

    I haven't tested this yet, so let me know if it doesn't act as expected.

    James Burgess : That's actually really helpful, thanks :)

0 comments:

Post a Comment