process IA uploads and deletions from privacy toggle by teovin · Pull Request #3775 · harvard-lil/perma

teovin · 2026-05-20T19:42:55Z

This PR allows Perma to sync with Internet Archive when a permanent link’s privacy is toggled via the API.

Private → public: set internet_archive_upload_status to upload_or_reupload_required and enqueue an immediate IA upload if the link is eligible.
Public → private: set internet_archive_upload_status to deletion_required and enqueue an immediate IA deletion if the file is deletable.

Dedicated scheduled celery jobs: Process anything leftover, reusing upload_link_to_internet_archive and delete_link_from_daily_item.

codecov · 2026-05-20T19:52:11Z

Codecov Report

❌ Patch coverage is 34.84848% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.66%. Comparing base (db931d6) to head (9756fa6).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines	Patch %	Lines
perma_web/perma/celery_tasks.py	22.72%	34 Missing ⚠️
perma_web/perma/models/link.py	18.18%	9 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3775      +/-   ##
===========================================
- Coverage    69.87%   69.66%   -0.21%     
===========================================
  Files           73       73              
  Lines         8029     8083      +54     
===========================================
+ Hits          5610     5631      +21     
- Misses        2419     2452      +33

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

teovin · 2026-05-22T15:35:33Z

    logger.info(f"Queued { len(queued) } links for deletion ({queued[0]} through {queued[-1]}).")


+def request_internet_archive_upload_from_privacy_toggle(link):


Argh, I am not fully convinced that this and the below should be in this file. Let me know if you have a better location suggestion.

Hmmm, I think we can do without this helper at all: the toggle logic can call upload_link_to_internet_archive.delay(link.guid) directly, since the first thing that upload_link_to_internet_archive does is:

# Get the link and verify that it is eligible for upload link = Link.objects.get(guid=link_guid) if not link.can_upload_to_internet_archive(): logger.info(f"Queued Link {link_guid} no longer eligible for upload.") return

https://github.com/harvard-lil/perma/blob/develop/perma_web/perma/celery_tasks.py#L576

Do you agree that this check is redundant, or am I overlooking a subtlety (definitely a possibility)?

Yep, you are right. Will call upload_link_to_internet_archive.delay() from the view.

rebeccacremona

Nice work!! It will be delightful to have this functionality reintroduced, after its long hiatus!

I asked a bunch of design questions; if any don't make sense just let me know!

rebeccacremona · 2026-05-27T20:03:29Z

+def queue_link_updates_to_ia(first_link_to_queue, to_queue, task):
+    """
+    Queue tasks to the Internet Archive.
+    """


Huzzah for spotting similar patterns, and not wanting to duplicate logic. I'm not sure the ergonomics quite hit the sweet spot yet.

There is still a fair bit of duplication in the three spots that call it: construct a query, pull the first object of the iterator, check for existence, log if not, proceed if yes, and finally, log about the work completed. I think it would be better to either craft a helper that extracts more of the common pattern out into something reusable, or instead to just keep everything inline and live with the similarities; I'm not sure that just abstracting out this piece improves readability.

I observe that this is in fact a generic function for invoking celery tasks that take link.guid as a single arg, rather than a function for queuing internet archive tasks: you could pass it any task, and it would work. So, even if kept, it probably wants renaming.

Agreed! I pulled a few more of the duplicated logic into this helper, and also renamed it.

rebeccacremona · 2026-05-27T20:06:57Z

    logger.info(f"Queued { len(queued) } links for deletion ({queued[0]} through {queued[-1]}).")


+def request_internet_archive_upload_from_privacy_toggle(link):


Hmmm, I think we can do without this helper at all: the toggle logic can call upload_link_to_internet_archive.delay(link.guid) directly, since the first thing that upload_link_to_internet_archive does is:

# Get the link and verify that it is eligible for upload link = Link.objects.get(guid=link_guid) if not link.can_upload_to_internet_archive(): logger.info(f"Queued Link {link_guid} no longer eligible for upload.") return

https://github.com/harvard-lil/perma/blob/develop/perma_web/perma/celery_tasks.py#L576

Do you agree that this check is redundant, or am I overlooking a subtlety (definitely a possibility)?

rebeccacremona · 2026-05-27T20:09:23Z

+    Queue upload tasks for permanent links marked upload_or_reupload_required that are
+    currently eligible for IA but do not already have an in-flight or completed daily file.
+    Backfills links that became public after their creation-date daily item was marked
+    complete, or that were not yet playable when first marked for upload.


or that were not yet playable when first marked for upload.

I'm having trouble imagining when that happens. Can you help remind me?

I am thinking that when the privacy is toggled before the cache playback status celery job is run, and the link's playback status is not true, which would make it ineligible to upload to IA at the time of the toggle. Correct me if I am missing anything, which is very likely.

If that is the case, then why mark it for upload? It seems to me to make more sense to only mark links that are in fact eligible for upload "upload_or_reupload_required", rather than labeling it "upload_or_reupload_required" and then having another function say, oh, just kidding, it's just a baby link... and then have the link in question retain that status.

rebeccacremona · 2026-05-27T20:22:52Z

+        'queue_internet_archive_deletions_required_from_privacy_toggle': {
+            'task': 'perma.celery_tasks.queue_internet_archive_deletions_required_from_privacy_toggle',
+            'schedule': crontab(minute="20,50"),
+        },


Right, following up on in person discussion, can you say a word about how you picked those times for the crons to run? It might be that we don't have to think about it too hard, but would love to hear your reasoning, as we reason through how this may or may not affect other IA-related scheduling.

Hmm, I expected the volume of these tasks to be pretty low, but still wanted to have a close enough to being real-time as opposed to once every 24 hrs for example.

👍 Okay, cool, so basically "twice an hour, and I just picked times".

We can have a look in grafana to see when the IA read-only queue is most often empty, and make sure that this task will run then. Not required, but kind of nice: it will be nice to see it run when scheduled, and not have it stuck behind a queue of other tasks.

rebeccacremona · 2026-05-27T20:53:03Z

+
+        if limit is not None:
+            query = query[:limit]
+        return query


There are a couple things I am thinking about here.

One is, database indexes, and whether we have the ones we need to make these queries doable.

Another is, whether checking permanent and visible_to_ia is correct to check here. Two thought come to mind.

One is, I think it would be incorrect for a link that should not be uploaded to IA, because it is less than a day old (==not permanent) or shouldn't be visible to IA, to get marked upload_or_reupload_required in the first place. I feel like there should be a check at the moment we are considering marking it that, and then the sanity-ensuring, already-existing check in the upload task, that makes extra sure before it does anything. I worry that checking here makes this query more expensive, without much benefit. Do you have thoughts about that?

Similarly, I'm not sure about checking here about whether a link is safe to delete is right. So much can change between the time a task is queued and the time it is executed: it seems safer and more efficient to me to check right when attempting the deletion... and, possibly, when applying the deletion_required label. (Though, no harm if there is in fact nothing to delete, that would just be a no-op).

Do you have thoughts on tightening this? Do you think my concerns are correct?

It looks like we have an index on the internet_archive_upload_status field. Also, you are right, and we are checking for permanent in the link view before updating the record's internet_archive_upload_status, and also checking for visible_to_ia in upload_link_to_internet_archive. So I removed those.

For other internet archive-related work, we have needed composite indicies, like these: https://github.com/harvard-lil/perma/blob/develop/perma_web/perma/models/link.py#L175C1-L176C161

I can help check too, once you are done editing, but my suspicion is that a single index on that one field would not be used, or would not be enough to prevent the db from doing a fair amount of work. Do you know how to check that? I have a few techniques, none of which are totally zippy, but which are easier to do now with a little help from Claude, filling up one's local database with thousands of rows of realistic data lol. Happy to describe!

rebeccacremona · 2026-05-27T20:56:01Z

-                link.save(update_fields=["internet_archive_upload_status"])
+                    link.save(update_fields=["internet_archive_upload_status"])
+                    logger.info(f"Link {link.guid} was toggled to private. Requesting the IA deletion.")
+                    request_internet_archive_deletion_from_privacy_toggle(link)


Do I recall correctly that, in production, you saw no links with deletion_required or upload_or_reupload_required? If that's the case, we should make sure these branches are running as expected in prod...

That also suggests we will need to run something to re-populate those fields with accurate info.

Also: I think there is no code that unsets internet_archive_upload_status after the upload or deletion is completed. Is that right? Unless I'm mistaken, I think that means that as designed, that the queryset of links that we look at during the queuing stage gets bigger and bigger over time, and then we filter out by links with an IAFile with status completed. Am I right to be concerned about efficiency here? Should we change internet_archive_upload_status? I note that is an old field, designed for the original, one-link-per-IA-item setup that we are migrating away from, so: if the new design needs something better, we can introduce a new field (or fields) instead of this one.

I am taking my word back. It looks like previously I checked the ia file model's status field instead of link's ia upload status field (facepalm). I rechecked, and we have 7376 records with either of the statuses.

I am thinking about your unsetting the status column comment, which is very legit. Maybe setting those in the confirmation tasks? completed for uploaded ones, deleted for deleted ones.. Still thinking...

we have 7376 records with either of the statuses.
That's great news!! I'm so glad lol.

That's also useful info: we can check the dates (with some thinking about... which dates are relevant...) and find out how many links have been toggled in recent years!

Maybe setting those in the confirmation tasks?... till thinking...

How about, in the confirmation task, setting that status field to None. I think we should probably change this field, so that it is nullable and the only allowed values are upload_or_reupload_required and deletion_required: all the other values only map to realities of the old process.

I will also think about other possibilities 🤔

I added the above, but let me know if I jumped the gun..

process ia uploads and deletions from privacy toggle

74da194

teovin requested a review from rebeccacremona May 21, 2026 13:28

teovin added 5 commits May 22, 2026 08:22

save the link before ia calls

ed39471

move task calls to helper

0ab7293

account for limit 0, otherwise query will grab all

a48e73c

move things around, dry a bit

efbc4d8

minor fix

f4d1c71

teovin commented May 22, 2026

View reviewed changes

fix merge conflict

b4ea927

rebeccacremona reviewed May 27, 2026

View reviewed changes

teovin added 4 commits May 28, 2026 11:53

remove redundant helper

3cfb54f

dont update the queue_link_updates_to_ia yet

aa456bc

pull more duplicated logic into task dispatcher

116f2ed

remove unneeded query items

346be6c

rebeccacremona added the no-nudge label May 29, 2026

update ia status field when update is confirmed

9756fa6

		logger.info(f"Queued { len(queued) } links for deletion ({queued[0]} through {queued[-1]}).")


		def request_internet_archive_upload_from_privacy_toggle(link):

Conversation

teovin commented May 20, 2026

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rebeccacremona left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rebeccacremona May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 20, 2026 •

edited

Loading

rebeccacremona May 29, 2026 •

edited

Loading