Skip to content

process IA uploads and deletions from privacy toggle#3775

Open
teovin wants to merge 12 commits into
harvard-lil:developfrom
teovin:privacy-toggle-ia-uploads-deletions
Open

process IA uploads and deletions from privacy toggle#3775
teovin wants to merge 12 commits into
harvard-lil:developfrom
teovin:privacy-toggle-ia-uploads-deletions

Conversation

@teovin
Copy link
Copy Markdown
Contributor

@teovin teovin commented May 20, 2026

This PR allows Perma to sync with Internet Archive when a permanent link’s privacy is toggled via the API.

Private → public: set internet_archive_upload_status to upload_or_reupload_required and enqueue an immediate IA upload if the link is eligible.
Public → private: set internet_archive_upload_status to deletion_required and enqueue an immediate IA deletion if the file is deletable.

Dedicated scheduled celery jobs: Process anything leftover, reusing upload_link_to_internet_archive and delete_link_from_daily_item.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 34.84848% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.66%. Comparing base (db931d6) to head (9756fa6).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
perma_web/perma/celery_tasks.py 22.72% 34 Missing ⚠️
perma_web/perma/models/link.py 18.18% 9 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3775      +/-   ##
===========================================
- Coverage    69.87%   69.66%   -0.21%     
===========================================
  Files           73       73              
  Lines         8029     8083      +54     
===========================================
+ Hits          5610     5631      +21     
- Misses        2419     2452      +33     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@teovin teovin requested a review from rebeccacremona May 21, 2026 13:28
Comment thread perma_web/perma/celery_tasks.py Outdated
logger.info(f"Queued { len(queued) } links for deletion ({queued[0]} through {queued[-1]}).")


def request_internet_archive_upload_from_privacy_toggle(link):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, I am not fully convinced that this and the below should be in this file. Let me know if you have a better location suggestion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I think we can do without this helper at all: the toggle logic can call upload_link_to_internet_archive.delay(link.guid) directly, since the first thing that upload_link_to_internet_archive does is:

    # Get the link and verify that it is eligible for upload
    link = Link.objects.get(guid=link_guid)
    if not link.can_upload_to_internet_archive():
        logger.info(f"Queued Link {link_guid} no longer eligible for upload.")
        return

https://github.com/harvard-lil/perma/blob/develop/perma_web/perma/celery_tasks.py#L576

Do you agree that this check is redundant, or am I overlooking a subtlety (definitely a possibility)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you are right. Will call upload_link_to_internet_archive.delay() from the view.

Copy link
Copy Markdown
Contributor

@rebeccacremona rebeccacremona left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!! It will be delightful to have this functionality reintroduced, after its long hiatus!

I asked a bunch of design questions; if any don't make sense just let me know!

def queue_link_updates_to_ia(first_link_to_queue, to_queue, task):
"""
Queue tasks to the Internet Archive.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huzzah for spotting similar patterns, and not wanting to duplicate logic. I'm not sure the ergonomics quite hit the sweet spot yet.

  • There is still a fair bit of duplication in the three spots that call it: construct a query, pull the first object of the iterator, check for existence, log if not, proceed if yes, and finally, log about the work completed. I think it would be better to either craft a helper that extracts more of the common pattern out into something reusable, or instead to just keep everything inline and live with the similarities; I'm not sure that just abstracting out this piece improves readability.

  • I observe that this is in fact a generic function for invoking celery tasks that take link.guid as a single arg, rather than a function for queuing internet archive tasks: you could pass it any task, and it would work. So, even if kept, it probably wants renaming.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I pulled a few more of the duplicated logic into this helper, and also renamed it.

Comment thread perma_web/perma/celery_tasks.py Outdated
logger.info(f"Queued { len(queued) } links for deletion ({queued[0]} through {queued[-1]}).")


def request_internet_archive_upload_from_privacy_toggle(link):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I think we can do without this helper at all: the toggle logic can call upload_link_to_internet_archive.delay(link.guid) directly, since the first thing that upload_link_to_internet_archive does is:

    # Get the link and verify that it is eligible for upload
    link = Link.objects.get(guid=link_guid)
    if not link.can_upload_to_internet_archive():
        logger.info(f"Queued Link {link_guid} no longer eligible for upload.")
        return

https://github.com/harvard-lil/perma/blob/develop/perma_web/perma/celery_tasks.py#L576

Do you agree that this check is redundant, or am I overlooking a subtlety (definitely a possibility)?

Comment thread perma_web/perma/celery_tasks.py Outdated
Queue upload tasks for permanent links marked upload_or_reupload_required that are
currently eligible for IA but do not already have an in-flight or completed daily file.
Backfills links that became public after their creation-date daily item was marked
complete, or that were not yet playable when first marked for upload.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or that were not yet playable when first marked for upload.

I'm having trouble imagining when that happens. Can you help remind me?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that when the privacy is toggled before the cache playback status celery job is run, and the link's playback status is not true, which would make it ineligible to upload to IA at the time of the toggle. Correct me if I am missing anything, which is very likely.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, then why mark it for upload? It seems to me to make more sense to only mark links that are in fact eligible for upload "upload_or_reupload_required", rather than labeling it "upload_or_reupload_required" and then having another function say, oh, just kidding, it's just a baby link... and then have the link in question retain that status.

'queue_internet_archive_deletions_required_from_privacy_toggle': {
'task': 'perma.celery_tasks.queue_internet_archive_deletions_required_from_privacy_toggle',
'schedule': crontab(minute="20,50"),
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, following up on in person discussion, can you say a word about how you picked those times for the crons to run? It might be that we don't have to think about it too hard, but would love to hear your reasoning, as we reason through how this may or may not affect other IA-related scheduling.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I expected the volume of these tasks to be pretty low, but still wanted to have a close enough to being real-time as opposed to once every 24 hrs for example.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Okay, cool, so basically "twice an hour, and I just picked times".

We can have a look in grafana to see when the IA read-only queue is most often empty, and make sure that this task will run then. Not required, but kind of nice: it will be nice to see it run when scheduled, and not have it stuck behind a queue of other tasks.

Comment thread perma_web/perma/models/link.py

if limit is not None:
query = query[:limit]
return query
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple things I am thinking about here.

One is, database indexes, and whether we have the ones we need to make these queries doable.

Another is, whether checking permanent and visible_to_ia is correct to check here. Two thought come to mind.

One is, I think it would be incorrect for a link that should not be uploaded to IA, because it is less than a day old (==not permanent) or shouldn't be visible to IA, to get marked upload_or_reupload_required in the first place. I feel like there should be a check at the moment we are considering marking it that, and then the sanity-ensuring, already-existing check in the upload task, that makes extra sure before it does anything. I worry that checking here makes this query more expensive, without much benefit. Do you have thoughts about that?

Similarly, I'm not sure about checking here about whether a link is safe to delete is right. So much can change between the time a task is queued and the time it is executed: it seems safer and more efficient to me to check right when attempting the deletion... and, possibly, when applying the deletion_required label. (Though, no harm if there is in fact nothing to delete, that would just be a no-op).

Do you have thoughts on tightening this? Do you think my concerns are correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we have an index on the internet_archive_upload_status field. Also, you are right, and we are checking for permanent in the link view before updating the record's internet_archive_upload_status, and also checking for visible_to_ia in upload_link_to_internet_archive. So I removed those.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other internet archive-related work, we have needed composite indicies, like these: https://github.com/harvard-lil/perma/blob/develop/perma_web/perma/models/link.py#L175C1-L176C161

I can help check too, once you are done editing, but my suspicion is that a single index on that one field would not be used, or would not be enough to prevent the db from doing a fair amount of work. Do you know how to check that? I have a few techniques, none of which are totally zippy, but which are easier to do now with a little help from Claude, filling up one's local database with thousands of rows of realistic data lol. Happy to describe!

link.save(update_fields=["internet_archive_upload_status"])
link.save(update_fields=["internet_archive_upload_status"])
logger.info(f"Link {link.guid} was toggled to private. Requesting the IA deletion.")
request_internet_archive_deletion_from_privacy_toggle(link)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I recall correctly that, in production, you saw no links with deletion_required or upload_or_reupload_required? If that's the case, we should make sure these branches are running as expected in prod...

That also suggests we will need to run something to re-populate those fields with accurate info.

Also: I think there is no code that unsets internet_archive_upload_status after the upload or deletion is completed. Is that right? Unless I'm mistaken, I think that means that as designed, that the queryset of links that we look at during the queuing stage gets bigger and bigger over time, and then we filter out by links with an IAFile with status completed. Am I right to be concerned about efficiency here? Should we change internet_archive_upload_status? I note that is an old field, designed for the original, one-link-per-IA-item setup that we are migrating away from, so: if the new design needs something better, we can introduce a new field (or fields) instead of this one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am taking my word back. It looks like previously I checked the ia file model's status field instead of link's ia upload status field (facepalm). I rechecked, and we have 7376 records with either of the statuses.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about your unsetting the status column comment, which is very legit. Maybe setting those in the confirmation tasks? completed for uploaded ones, deleted for deleted ones.. Still thinking...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have 7376 records with either of the statuses.
That's great news!! I'm so glad lol.

That's also useful info: we can check the dates (with some thinking about... which dates are relevant...) and find out how many links have been toggled in recent years!

Copy link
Copy Markdown
Contributor

@rebeccacremona rebeccacremona May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe setting those in the confirmation tasks?... till thinking...

How about, in the confirmation task, setting that status field to None. I think we should probably change this field, so that it is nullable and the only allowed values are upload_or_reupload_required and deletion_required: all the other values only map to realities of the old process.

I will also think about other possibilities 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the above, but let me know if I jumped the gun..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants