Indie Gems

Get indie song recommendations, because the ones in our music players are not too good!

How does this work?

The idea is to recommend songs that sound similar, so rhythmic and tonality factors should play an important role in suggesting similar songs. With this in mind, each recommendation is based on similarities between the BPM (beats per minute), the Camelot value (which is related to musical key), and the lyrics of two songs. Users can select custom weights for each component.

About the scraped data

Tools for scraping the web are stored in the scraping directory.

spotify_scraper.py will take in a command-line argument with a url to a Spotify playlist. This script is no longer necessary as we can easily download a playlist as a csv from Chosic, so a nice change would be to automate this.
lyrics_scraper.py looks for lyrics of the songs in AZLyrics. It starts the search in Google, which brings problems if the lyrics for the song are not present within the first results. A solution would be to automate the search within the AZLyrics website.
table_merger.py puts together the lyric information with the other data from Chosic or the Spotify scraper.

About lyric vectorization

Tools used for lyric vectorization are stored in the lyrics directory.

lyrics_preprocessing.py uses NLTK to put lyrics in lowercase and remove special characters
lyrics_vectorization.py takes in the pre-processed lyrics and puts them in a vector form. This vector is formed by extracting the final classification of DistilBERT from HuggingFace and using it to extract features from the lyrics.

Current plans for improvement

Improve the comparison of song lyrics. Currently, the vanilla version of DistilBERT is used for producing vector embeddings for the lyrics. The computed Pearson correlations lie mostly around 90-95%, such lack of significant differences makes the computation of similarity scores unreliable. The first ideas are to fine-tune the model with a dataset specific to text comparison, remove stop words ('the', 'is', 'it'...), or simply use a different model that was trained specifically for comparing texts.
- UPDATE: The DistilBERT model was switched to "sentence-transformers/all-mpnet-base-v2" from the sentence-transformers library, also from HuggingFace. Similarity is now computed as a vector similarity, instead of a Pearson correlation. This gives values between 0 and 1. Still, experiments to fine-tune this or another model, or to create a custom tokenizer should be considered.
Update scraping scripts for a straightforward method of increasing the dataset.
Refactor code to retrieve data from the SQLite table to avoid dependency on pkl and csv files
Handle 'fuzzy' search to account for typos or other factors that may make a text input different to the values stored in the dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
indiegems		indiegems
lyrics		lyrics
res		res
scraping		scraping
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
workable-dataset.csv		workable-dataset.csv
workable-dataset.pkl		workable-dataset.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Indie Gems

How does this work?

About the scraped data

About lyric vectorization

Current plans for improvement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Indie Gems

How does this work?

About the scraped data

About lyric vectorization

Current plans for improvement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages