WARNING: Use this data at your own risk. Ensure all the data is correct and how you want it. There is no guarantee on the completeness and correctness of the data.
This repository contains raw data sources scraped from two sources:
The data includes:
- Discourse Posts: JSON files representing the topic post streams exported from the Discourse JSON API endpoint.
- Website Pages: Markdown files generated from the HTML pages of the TDS 2025-01 website.
- discourse_posts.json: A single JSON file containing all the Discourse posts.
- discourse_json/: A directory containing individual JSON files for each topic post stream.
- tds_pages_md/: A directory containing Markdown files for each page scraped from the TDS 2025-01 website.
-
Clone the Repository
git clone https://github.com/23f3004008/TDS-Project1-Data.git cd TDS-Project1-Data -
Install Dependencies
Ensure you have Python installed, then run:
pip install -r requirements.txt
To download the data, you can use the provided Python scripts:
-
Discourse Posts
python discourse_downloader_full.py
-
Website Pages
python website_downloader_full.py
-
Discourse Posts
The
discourse_posts.jsonfile contains all the Discourse posts. You can view this file directly or use a JSON viewer. -
Website Pages
The Markdown files in the
tds_pages_md/directory can be opened in any text editor or Markdown viewer.
Contributions are welcome! Please fork the repository and submit a pull request with your changes.
This project is licensed under the MIT License - see the LICENSE file for details.