Skip to content
This repository was archived by the owner on Aug 5, 2023. It is now read-only.
This repository was archived by the owner on Aug 5, 2023. It is now read-only.

Improving similarity measure #3

@ctlaltdefeat

Description

@ctlaltdefeat

Hi,

I actually did something similar to this a while back at https://channel-similarity.johnpyp.com/ but it didn't generate much interest, presumably due to among other things a lack of visualization which is a core part of this project and which makes it cool and interesting to look at and is well executed.

However, I do think that the similarity measure that I used is better in the sense of capturing similarity between channel communities, and could be implemented here without too much issue. Mathematically what I did is outlined at https://channel-similarity.johnpyp.com/details but it essentially boils down to a couple of differences from what you currently have going on right now:

  1. The weight of a viewer should be normalized according to the number of channels that they're in. The reason behind this is that we want the relative weight of that user to be determined by how much of the percent of their viewing is dedicated to that channel.
    For example: channel A and channel B sharing a viewer that ONLY views these two channels on the entire site should account for more than if channel A and channel B happen to share Nightbot that's present on a large chunk of channels on the site. Currently their relative weight for similarity is the same.
    (this should be fairly simple to implement, doesn't require any scraping changes, and can also be used for the realtime channel page view)

  2. The weight of a viewer should be determined not only by if they happened to be in that channel during the time period collected, but by the amount of time spent there (i.e the number of scrapes they appeared in).
    An example of the shortcoming of the current approach: if channel A happens to host a channel B during the period collected, then all those chatters appearing momentarily in channel B's chat currently provide as much weight to similarity as chatters that spend long periods of time in both of these channels.
    (this could require scraping changes to store these values)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions