Internyl Website Repo: https://github.com/internyl-dev/internyl-frontend
Internyl Data Acquisition Repo: https://github.com/internyl-dev/internyl-ai-wrapper/tree/refactor
Internyl Website: https://internyl.org
- Navigate to .env and input any API keys you may have there
# 1. Create a virtual environment
python -m venv .venv
# 2. Activate the virtual environment
Set-ExecutionPolicy -ExecutionPolicy Bypass -Scope Process
.venv/scripts/activate
# 3. Install requirements
pip install -r requirements.txt
playwright install
# 4. Run run.py
python run.py
The discovery agent utilizes a LangChain agent execution chain to perform the agentic behavior. All it does is take a specific set of instructions written to look for programs not already put on a given section of the database (in our case, programs-display) and look specifically for a webpage containing just one internship and its overview information.
src/
└── models/
└── response_model.py
Output - programs: List[HttpUrl]
This stores the programs the agent finds as a list of URLs.
src/
└── prompts/
├── assets/
│ └── prompt.py
└── prompt_create.py
The model query aims at prioritizing the following:
- Returning pages with one unique program
- Each page should contain exactly one program
- If a page with a singular program cannot be found, return the overview page of the multiple programs
- Avoid list websites (eg. "Top 10 Internships")
- Each page should contain a unique program
- Check the given programs to make sure the program does not already exist
- Each page should contain exactly one program
- Returning the overview page
- Look for the page with the overview information about the program
- Efficient search methods
- Subject specific
- Institution specific
- Program specific
- Deduplication
src/
└── tools/
├── content_scraper.py
├── ddgs_run.py
└── links_scraper.py
Search - search (search_term:str, max_results:int)
Using the DDGS search module, return the search results given a search term.
Visit URL - visit_url (url:str)
Using Playwright return the parsed webpage contents of a URL.
Get All Links - get_all_links (url:str)
Using Playwright return all URLs found within a webpage.