Skip to content

add video platform#36

Open
rom1504 wants to merge 1 commit into
mainfrom
video_plat
Open

add video platform#36
rom1504 wants to merge 1 commit into
mainfrom
video_plat

Conversation

@rom1504

@rom1504 rom1504 commented Jan 6, 2023

Copy link
Copy Markdown
Owner

does not actually work well

does not actually work well
@rom1504

rom1504 commented Jan 6, 2023

Copy link
Copy Markdown
Owner Author

ideas to fix:

  • whitelist platforms
  • classifier to exclude non video

@rom1504 rom1504 mentioned this pull request Apr 2, 2023
Comment thread cc2dataset/main.py
from yt_dlp.extractor import gen_extractor_classes, GenericIE


def valid_video_platform_link(link):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try something like this

import yt_dlp


FILTERED_EXTRACTORS = {ie.IE_NAME:ie for ie in yt_dlp.list_extractor_classes() 
                       if ie not in generic_extractors 
                       and "porn" not in ie.IE_NAME.lower()
                       and "adult" not in ie.IE_NAME.lower()
                       and "xxx" not in ie.IE_NAME.lower()
                       and "xvideos" not in ie.IE_NAME.lower()
                       and "xhamster" not in ie.IE_NAME.lower()
                       and "redtube" not in ie.IE_NAME.lower()
                       and "xtube" not in ie.IE_NAME.lower()
                       and "xstream" not in ie.IE_NAME.lower()
                       and "xfileshare" not in ie.IE_NAME.lower()
                       and "sex" not in ie.IE_NAME.lower()
                       }


# print(FILTERED_EXTRACTORS.keys())
# print(len(FILTERED_EXTRACTORS.keys()))

def is_link_valid(link, extractors):
    """Check if link is valid given a list of extractors."""
    return any([ie.suitable(link) for ie in extractors])

def valid_video_platform_link(link):
    """Check if link is a valid video platform link."""
    return link and is_link_valid(link, FILTERED_EXTRACTORS.values())


YT_URL = "https://www.youtube.com/watch?v=jLX0D8qQUBM"
DM_URL = "https://www.dailymotion.com/video/x29ryo7"
print(valid_video_platform_link(YT_URL))
print(valid_video_platform_link(DM_URL))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generic_extractors = [yt_dlp.extractor.generic.GenericIE,
yt_dlp.extractor.lazy_extractors.GenericIE]

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried at #49

one first problem: running these thousands of regexes is actually quite slow. I guess we need to limit the list or find a way to merge them into one to speed things up (I think that should help?)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also let's try the age_limit property in yt-dlp

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Another option is to do it in 2 steps

  • first checks that the domain is valid among ~2000 selected domains (from yt_dlp extractors _TESTS)
  • then checks if the url is a valid video url (wwith regex from yt_dlp)

This version is more than 100x faster (but is less exhaustive)

@vinyesm vinyesm Nov 16, 2023

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import yt_dlp
from urllib.parse import urlparse

generic_extractors = [yt_dlp.extractor.generic.GenericIE,
yt_dlp.extractor.lazy_extractors.GenericIE]

FILTERED_EXTRACTORS = {ie.IE_NAME:ie for ie in yt_dlp.list_extractor_classes()
if ie not in generic_extractors
and "porn" not in ie.IE_NAME.lower()
and "adult" not in ie.IE_NAME.lower()
and "xxx" not in ie.IE_NAME.lower()
and "xvideos" not in ie.IE_NAME.lower()
and "xhamster" not in ie.IE_NAME.lower()
and "redtube" not in ie.IE_NAME.lower()
and "xtube" not in ie.IE_NAME.lower()
and "xstream" not in ie.IE_NAME.lower()
and "xfileshare" not in ie.IE_NAME.lower()
and "sex" not in ie.IE_NAME.lower()
}

def extract_test(extractor):
tests = []
if hasattr(extractor, "_TEST"):
tests = [extractor._TEST["url"]]
elif hasattr(extractor, "_TESTS"):
tests = [x["url"] for x in extractor._TESTS]
return tests

def normalize_domain(domain):
domain = domain.lower()
if domain.startswith("www."):
domain = domain[4:]
return domain

def extract_domain(url):
parsed_url = urlparse(url)
domain = parsed_url.netloc
return normalize_domain(domain)

DOMAIN_DICT = {}

for extractor in FILTERED_EXTRACTORS.values():
for url in extract_test(extractor):
domain = extract_domain(url)
if domain in DOMAIN_DICT:
DOMAIN_DICT[domain] = DOMAIN_DICT[domain] + [extractor]
else:
DOMAIN_DICT[domain] = [extractor]

def is_link_suitable(link, extractors):
"""Check if link is valid given an extractor."""
return any([ie.suitable(link) for ie in extractors])

def is_link_valid(link, domain_dict):
"""Check if link is valid given a list of extractors."""
is_valid = False
domain = extract_domain(link)
if domain in domain_dict:
is_valid = is_link_suitable(link, domain_dict[domain])
return is_valid

def valid_video_platform_link(link):
"""Check if link is a valid video platform link."""
return link and is_link_valid(link, DOMAIN_DICT)

YT_URL = "https://www.youtube.com/watch?v=jLX0D8qQUBM"
DM_URL = "https://www.dailymotion.com/video/x29ryo7"
DM_URL2 = "https://geo.dailymotion.com/player.html?video=x89eyek&mute=true"
print(valid_video_platform_link(YT_URL))
print(valid_video_platform_link(DM_URL))
print(valid_video_platform_link(DM_URL2))

import time

start_time = time.time()
[valid_video_platform_link(x) for x in [DM_URL] * 1000]
print("--- %s seconds ---" % (time.time() - start_time))

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I put the code there #49

it's faster indeed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Waiting for user input

Development

Successfully merging this pull request may close these issues.

2 participants