- Python 3.7
- Numpy
- Gensim 4.1.2
- sklearn 0.21.3+
Please put this file under the directory measuring-founding-strategy
Company html files should be stored in
../out2/company_name/timestamp/*.html
Then run the file for the complete pipeline
python helper.py
The final results will be stored in folder combined_pivot_si
The Pipeline is made of several parts.
- Reading all the html files from target folder and extract timestamps and texts
- Train Doc2Vec model using the extracted info
- Compute similarity scores and write results to csv files
- Use mAP to evaluate how the model works on document retrieval task.
- Find the best number of clusters using Silhouette score
- Train cluters using Dbscan instead of k-means
- Determine threshold using the mapping function and user sensitivity