Conversation
…s, create subjects
…reate_price, create_location, begin refactoring
rhigman
left a comment
There was a problem hiding this comment.
Looks good! I've made a few comments about edge cases etc as they've occurred to me, but they may already be things you've thought of and chosen not to overcomplicate the logic with.
csvloader.py
Outdated
| # Convert NaN to None for all fields | ||
| contributor = self.convert_nan_to_none(contributor) | ||
|
|
||
| if full_name not in self.all_contributors: |
There was a problem hiding this comment.
This could backfire for common names like "John Smith".
There was a problem hiding this comment.
I totally agree, and it's also following a common pattern in the repo - 10 other loaders use the same logic. Without an ORCID, which we don't require and is frequently not present, is there another way that we can fix this?
There was a problem hiding this comment.
I added some logic for checking by ORCID before full_name, which at least improves the existing logic, even if it doesn't fix the common names issue
| else: | ||
| contributor_id = self.all_contributors[full_name] | ||
| logging.info(f"contributor {full_name} already in Thoth, skipping creation") | ||
| existing_contribution = next( |
There was a problem hiding this comment.
Note it's possible for the same contributor to be listed with multiple contributions of different types
|
|
||
| existing_affiliations = next( | ||
| (c.affiliations for c in work.contributions if c.contributionId == contribution_id), []) | ||
| if any(a.institution.institutionId == institution_id for a in existing_affiliations): |
There was a problem hiding this comment.
More than one affiliation at the same institution is technically possible, I think
| if publication_type in ["paperback", "hardback"]: | ||
| self.create_price(row, publication_type, publication_id) | ||
| elif publication_type in ["pdf", "epub"]: | ||
| self.create_location(row, publication_type, publication_id) |
There was a problem hiding this comment.
I'm intrigued that the template presumably doesn't contain locations for paperback/hardback, when they can have landing pages if not full text URLs (and, indeed, epubs can have prices)
| if pd.notna(row[f"publication_{publication_type}_price_{index}_currency_code"]): | ||
| currency = row[f"publication_{publication_type}_price_{index}_currency_code"] | ||
|
|
||
| if unit_price and currency: |
There was a problem hiding this comment.
Is there any possibility of hitting errors here if a publisher accidentally includes more than one price with the same currency?
| if pd.isna(series_name): | ||
| logging.info(f"{work.fullTitle} missing series name; skipping create_series") | ||
| return | ||
| if series_name not in self.all_series: |
csvloader.py
Outdated
|
|
||
| # if series_issue_number is present in CSV, use it | ||
| if pd.notna(row["series_issue_number"]): | ||
| issue_ordinal = row["series_issue_number"] |
There was a problem hiding this comment.
Worth checking that the series_issue_number doesn't happen to match an existing issue_ordinal?
| else: | ||
| logging.info(f"Series Issue already exists for work; skipping creating issue of {current_series}") | ||
|
|
||
| def convert_nan_to_none(self, data_dict): |
There was a problem hiding this comment.
Nice. I haven't checked the implications, but would there be any benefit to running this conversion on the whole record at the start, so as to bypass having to call this function multiple times?
|
Thanks for the comments, @rhigman ! I unfortunately won't have time to address them. I did read through them and appreciate you taking the time anyway. |
Addresses #27