Skip to content

Adding fake subdomains to a low-resource domain. #29

@Martovark

Description

@Martovark

I sometimes encounter a situation where there are very few examples of subdomains for a certain domain, for example, there is only www.some_domain.com or subdomain.com and there is no more data.

Two simple ideas arise:

  • Before tokenization, add fake subdomains that will provide additional information for the model and possibly improve search results. For example [git, gitlab, ...], or [ftp, mail, ...]. It seems like a difficult question, which subdomains to choose. Perhaps it is worth considering the distribution of the training data.
  • Enable the option to automatically restart subdomain scanning using information from newly found subdomains. Sure, you could get a new list of domains, paste it into example_input.txt, and restart, but that doesn't seem very convenient. You can continue searching in a loop until no new domains appear.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions