process IA uploads and deletions from privacy toggle#3775
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #3775 +/- ##
===========================================
- Coverage 69.87% 69.66% -0.21%
===========================================
Files 73 73
Lines 8029 8083 +54
===========================================
+ Hits 5610 5631 +21
- Misses 2419 2452 +33 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| logger.info(f"Queued { len(queued) } links for deletion ({queued[0]} through {queued[-1]}).") | ||
|
|
||
|
|
||
| def request_internet_archive_upload_from_privacy_toggle(link): |
There was a problem hiding this comment.
Argh, I am not fully convinced that this and the below should be in this file. Let me know if you have a better location suggestion.
There was a problem hiding this comment.
Hmmm, I think we can do without this helper at all: the toggle logic can call upload_link_to_internet_archive.delay(link.guid) directly, since the first thing that upload_link_to_internet_archive does is:
# Get the link and verify that it is eligible for upload
link = Link.objects.get(guid=link_guid)
if not link.can_upload_to_internet_archive():
logger.info(f"Queued Link {link_guid} no longer eligible for upload.")
return
https://github.com/harvard-lil/perma/blob/develop/perma_web/perma/celery_tasks.py#L576
Do you agree that this check is redundant, or am I overlooking a subtlety (definitely a possibility)?
There was a problem hiding this comment.
Yep, you are right. Will call upload_link_to_internet_archive.delay() from the view.
rebeccacremona
left a comment
There was a problem hiding this comment.
Nice work!! It will be delightful to have this functionality reintroduced, after its long hiatus!
I asked a bunch of design questions; if any don't make sense just let me know!
| def queue_link_updates_to_ia(first_link_to_queue, to_queue, task): | ||
| """ | ||
| Queue tasks to the Internet Archive. | ||
| """ |
There was a problem hiding this comment.
Huzzah for spotting similar patterns, and not wanting to duplicate logic. I'm not sure the ergonomics quite hit the sweet spot yet.
-
There is still a fair bit of duplication in the three spots that call it: construct a query, pull the first object of the iterator, check for existence, log if not, proceed if yes, and finally, log about the work completed. I think it would be better to either craft a helper that extracts more of the common pattern out into something reusable, or instead to just keep everything inline and live with the similarities; I'm not sure that just abstracting out this piece improves readability.
-
I observe that this is in fact a generic function for invoking celery tasks that take
link.guidas a single arg, rather than a function for queuing internet archive tasks: you could pass it any task, and it would work. So, even if kept, it probably wants renaming.
There was a problem hiding this comment.
Agreed! I pulled a few more of the duplicated logic into this helper, and also renamed it.
| logger.info(f"Queued { len(queued) } links for deletion ({queued[0]} through {queued[-1]}).") | ||
|
|
||
|
|
||
| def request_internet_archive_upload_from_privacy_toggle(link): |
There was a problem hiding this comment.
Hmmm, I think we can do without this helper at all: the toggle logic can call upload_link_to_internet_archive.delay(link.guid) directly, since the first thing that upload_link_to_internet_archive does is:
# Get the link and verify that it is eligible for upload
link = Link.objects.get(guid=link_guid)
if not link.can_upload_to_internet_archive():
logger.info(f"Queued Link {link_guid} no longer eligible for upload.")
return
https://github.com/harvard-lil/perma/blob/develop/perma_web/perma/celery_tasks.py#L576
Do you agree that this check is redundant, or am I overlooking a subtlety (definitely a possibility)?
| Queue upload tasks for permanent links marked upload_or_reupload_required that are | ||
| currently eligible for IA but do not already have an in-flight or completed daily file. | ||
| Backfills links that became public after their creation-date daily item was marked | ||
| complete, or that were not yet playable when first marked for upload. |
There was a problem hiding this comment.
or that were not yet playable when first marked for upload.
I'm having trouble imagining when that happens. Can you help remind me?
There was a problem hiding this comment.
I am thinking that when the privacy is toggled before the cache playback status celery job is run, and the link's playback status is not true, which would make it ineligible to upload to IA at the time of the toggle. Correct me if I am missing anything, which is very likely.
There was a problem hiding this comment.
If that is the case, then why mark it for upload? It seems to me to make more sense to only mark links that are in fact eligible for upload "upload_or_reupload_required", rather than labeling it "upload_or_reupload_required" and then having another function say, oh, just kidding, it's just a baby link... and then have the link in question retain that status.
| 'queue_internet_archive_deletions_required_from_privacy_toggle': { | ||
| 'task': 'perma.celery_tasks.queue_internet_archive_deletions_required_from_privacy_toggle', | ||
| 'schedule': crontab(minute="20,50"), | ||
| }, |
There was a problem hiding this comment.
Right, following up on in person discussion, can you say a word about how you picked those times for the crons to run? It might be that we don't have to think about it too hard, but would love to hear your reasoning, as we reason through how this may or may not affect other IA-related scheduling.
There was a problem hiding this comment.
Hmm, I expected the volume of these tasks to be pretty low, but still wanted to have a close enough to being real-time as opposed to once every 24 hrs for example.
There was a problem hiding this comment.
👍 Okay, cool, so basically "twice an hour, and I just picked times".
We can have a look in grafana to see when the IA read-only queue is most often empty, and make sure that this task will run then. Not required, but kind of nice: it will be nice to see it run when scheduled, and not have it stuck behind a queue of other tasks.
|
|
||
| if limit is not None: | ||
| query = query[:limit] | ||
| return query |
There was a problem hiding this comment.
There are a couple things I am thinking about here.
One is, database indexes, and whether we have the ones we need to make these queries doable.
Another is, whether checking permanent and visible_to_ia is correct to check here. Two thought come to mind.
One is, I think it would be incorrect for a link that should not be uploaded to IA, because it is less than a day old (==not permanent) or shouldn't be visible to IA, to get marked upload_or_reupload_required in the first place. I feel like there should be a check at the moment we are considering marking it that, and then the sanity-ensuring, already-existing check in the upload task, that makes extra sure before it does anything. I worry that checking here makes this query more expensive, without much benefit. Do you have thoughts about that?
Similarly, I'm not sure about checking here about whether a link is safe to delete is right. So much can change between the time a task is queued and the time it is executed: it seems safer and more efficient to me to check right when attempting the deletion... and, possibly, when applying the deletion_required label. (Though, no harm if there is in fact nothing to delete, that would just be a no-op).
Do you have thoughts on tightening this? Do you think my concerns are correct?
There was a problem hiding this comment.
It looks like we have an index on the internet_archive_upload_status field. Also, you are right, and we are checking for permanent in the link view before updating the record's internet_archive_upload_status, and also checking for visible_to_ia in upload_link_to_internet_archive. So I removed those.
There was a problem hiding this comment.
For other internet archive-related work, we have needed composite indicies, like these: https://github.com/harvard-lil/perma/blob/develop/perma_web/perma/models/link.py#L175C1-L176C161
I can help check too, once you are done editing, but my suspicion is that a single index on that one field would not be used, or would not be enough to prevent the db from doing a fair amount of work. Do you know how to check that? I have a few techniques, none of which are totally zippy, but which are easier to do now with a little help from Claude, filling up one's local database with thousands of rows of realistic data lol. Happy to describe!
| link.save(update_fields=["internet_archive_upload_status"]) | ||
| link.save(update_fields=["internet_archive_upload_status"]) | ||
| logger.info(f"Link {link.guid} was toggled to private. Requesting the IA deletion.") | ||
| request_internet_archive_deletion_from_privacy_toggle(link) |
There was a problem hiding this comment.
Do I recall correctly that, in production, you saw no links with deletion_required or upload_or_reupload_required? If that's the case, we should make sure these branches are running as expected in prod...
That also suggests we will need to run something to re-populate those fields with accurate info.
Also: I think there is no code that unsets internet_archive_upload_status after the upload or deletion is completed. Is that right? Unless I'm mistaken, I think that means that as designed, that the queryset of links that we look at during the queuing stage gets bigger and bigger over time, and then we filter out by links with an IAFile with status completed. Am I right to be concerned about efficiency here? Should we change internet_archive_upload_status? I note that is an old field, designed for the original, one-link-per-IA-item setup that we are migrating away from, so: if the new design needs something better, we can introduce a new field (or fields) instead of this one.
There was a problem hiding this comment.
I am taking my word back. It looks like previously I checked the ia file model's status field instead of link's ia upload status field (facepalm). I rechecked, and we have 7376 records with either of the statuses.
There was a problem hiding this comment.
I am thinking about your unsetting the status column comment, which is very legit. Maybe setting those in the confirmation tasks? completed for uploaded ones, deleted for deleted ones.. Still thinking...
There was a problem hiding this comment.
we have 7376 records with either of the statuses.
That's great news!! I'm so glad lol.
That's also useful info: we can check the dates (with some thinking about... which dates are relevant...) and find out how many links have been toggled in recent years!
There was a problem hiding this comment.
Maybe setting those in the confirmation tasks?... till thinking...
How about, in the confirmation task, setting that status field to None. I think we should probably change this field, so that it is nullable and the only allowed values are upload_or_reupload_required and deletion_required: all the other values only map to realities of the old process.
I will also think about other possibilities 🤔
There was a problem hiding this comment.
I added the above, but let me know if I jumped the gun..
This PR allows Perma to sync with Internet Archive when a permanent link’s privacy is toggled via the API.
Private → public: set
internet_archive_upload_statustoupload_or_reupload_requiredand enqueue an immediate IA upload if the link is eligible.Public → private: set
internet_archive_upload_statustodeletion_requiredand enqueue an immediate IA deletion if the file is deletable.Dedicated scheduled celery jobs: Process anything leftover, reusing
upload_link_to_internet_archiveanddelete_link_from_daily_item.