-
Notifications
You must be signed in to change notification settings - Fork 19
Updated CourseInfo uses multithreading instead of multiprocessing #147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
| from .abstract import EnhanceBase | ||
| from db.Models import DepartmentDistribution, ClassDistribution, Libed, Session, and_ | ||
| import requests | ||
| import httpx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not imported in pipenv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we use aiohttp for RMP already. I would encourage using that so we keep our surface area low with libraries. Check the RMP folder for how that looks like.
| finally: | ||
| session.close() | ||
|
|
||
| async def enhance_helper(self, dept_dist: DepartmentDistribution, semaphore: asyncio.Semaphore) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not fit abstract class structure.
| with Pool() as pool: | ||
| pool.map(self.enhance_helper, dept_dists) | ||
|
|
||
| semaphore = asyncio.Semaphore(9) # Limit concurrent tasks to something under 5 due to rate limiting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment conflicts with implementation
| campus_str = str(campus) | ||
| def _process_course_data(self, courses: list[dict], dept: str, campus: str) -> None: | ||
| session = Session() | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If even one course fails the entire batch will be considered faulty.
| response.raise_for_status() | ||
| req = response.json() | ||
| courses = req.get("courses", []) | ||
| except httpx.HTTPStatusError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to know if this is a prohibited error 4xx or a server error 5xx in our logs.
| campus = dept_dist.campus | ||
| campus_str = str(campus) | ||
| def _process_course_data(self, courses: list[dict], dept: str, campus: str) -> None: | ||
| session = Session() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sessions should be made for every granular change. You're batching a massive amount of data in one commit.
| # Only process UMNTC and UMNRO campuses | ||
| if campus_str not in ["UMNTC", "UMNRO"]: | ||
| return | ||
| courses = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this is necessary? Courses will either be assigned a value if the try succeeds. If it fails it should return with no-ops./
| courses = req.get("courses", []) | ||
| except ValueError: | ||
| print("Json malformed, icky!") | ||
| await asyncio.to_thread(self._process_course_data, courses, dept, campus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This spawns a seperate thread to run a synchronous function. Can you make _process_course_data async and just await it instead? Would reduce thread overhead.
|
|
||
| def enhance(self, dept_dists: list[DepartmentDistribution]) -> None: | ||
| async def enhance(self, dept_dists: list[DepartmentDistribution]) -> None: | ||
| """Enhance the data for a list of department distributions in a multiprocessing pool.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update comment.
|
In addition, you want to change the signature and logic of CourseDogEnhance. Even though we no longer use it, I would like to have backwards compatibility. |
|
Once you merge in changes from #148 can you also add the following to the pipenv so we force Python 3.14t usage? |
Using asyncio with python 3.14 lets python run multi threaded instead of single threaded.