Skip to content

Add utilities to detect and replace broken links.#1366

Open
barsh404error wants to merge 8 commits intoTogether-Java:developfrom
barsh404error:broken-links
Open

Add utilities to detect and replace broken links.#1366
barsh404error wants to merge 8 commits intoTogether-Java:developfrom
barsh404error:broken-links

Conversation

@barsh404error
Copy link

@barsh404error barsh404error commented Dec 30, 2025

Thanks alot to @christolis for helping me out on making this pull request.
Added two utulity methods isLinkBroken and replaceDeadLinks

-isLinkBroken(String url) checks the link availability using a HEAD request
I used HEAD request instead of GET request to check link availability without downloading the response body, reducing bandwidth and improving the performance.

-replaceDeadLinks (String text, String replacement) replaces unreachable/broken links asynchronously.

This change does not have any behavior changes to the existing code.

Part of #1276, implements the mentioned utility but doesnt apply it.

@barsh404error barsh404error requested a review from a team as a code owner December 30, 2025 10:52
@CLAassistant
Copy link

CLAassistant commented Dec 30, 2025

CLA assistant check
All committers have signed the CLA.

@barsh404error
Copy link
Author

@tj-wazei I've added your request and fixed them
I’ve updated the implementation to avoid mutating shared state in async callbacks and now collect the results before doing replacements sequentially.
I also reuse a single HttpClient and added a GET fallback when HEAD is not supported.
Let me know if anything else should be adjusted ^^

@barsh404error barsh404error requested a review from tj-wazei January 2, 2026 10:22
@barsh404error barsh404error changed the title Add utilities to detect and replace broken links. Add utilities to detect and replace broken links. Fixed #1276 Jan 3, 2026
@christolis christolis changed the title Add utilities to detect and replace broken links. Fixed #1276 Add utilities to detect and replace broken links. Jan 4, 2026
@Zabuzard Zabuzard self-requested a review January 4, 2026 15:37
Copy link
Member

@Zabuzard Zabuzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added methods are utility methods and for those it is crucial that they have a proper Javadoc explaining what they do and give examples.

The isLinkBroken method needs to explain what it means for a link to be broken and how it behaves in edge case, for example if the given text isnt a url and so on.

The other method needs to explain in more detail what the two parameters are and how they work, maybe giving a concrete example in the javadoc.

@barsh404error
Copy link
Author

The added methods are utility methods and for those it is crucial that they have a proper Javadoc explaining what they do and give examples.

The isLinkBroken method needs to explain what it means for a link to be broken and how it behaves in edge case, for example if the given text isnt a url and so on.

The other method needs to explain in more detail what the two parameters are and how they work, maybe giving a concrete example in the javadoc.

Yes, i got that solved made javadoc for isLinkBroken and replaceDeadLinks.
i explained what does it do step by step altho idk if i wrote to long

@barsh404error barsh404error requested a review from Zabuzard January 6, 2026 16:30
tj-wazei
tj-wazei previously approved these changes Jan 6, 2026
Copy link
Contributor

@tj-wazei tj-wazei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

int status = response.statusCode();
return status < 200 || status >= 400;
})
.exceptionally(ignored -> true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idiomatic name for something you ignore is _

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed ignored lambda parameters to _ where applicable (e.g. exceptionally(_ -> ...), thenApply(_ -> ...)) to clearly indicate intentional non-usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you forgot one:

.exceptionally(ignored -> true); // still never null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not done yet

Comment on lines +150 to +153
List<CompletableFuture<String>> deadLinkFutures = links.stream()
.distinct()
.map(link -> isLinkBroken(link)
.thenApply(isBroken -> Boolean.TRUE.equals(isBroken) ? link : null))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better: instead of map(foo -> ...thenApply(...)), use .map(foo -> ...).filter(...) then you also dont have all these null items in ur list, polluting it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not done yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not done yet

return CompletableFuture.allOf(deadLinkFutures.toArray(new CompletableFuture[0]))
.thenApply(ignored -> deadLinkFutures.stream()
.map(CompletableFuture::join)
.filter(Objects::nonNull)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.filter(Objects::nonNull) that one isnt needed anymore with the above fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not done yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not done yet

@barsh404error
Copy link
Author

I focused primarily on improving the Javadocs as requested @Zabuzard
While doing so, I slightly refactored the internal stream to use Optional instead of null to better match the documented behavior and avoid null handling
No external behavior was changed.

@barsh404error
Copy link
Author

Thanks for the detailed review @Zabuzard I went through the comments one by one and addressed them explicitly:

Javadoc & behavior clarification

Expanded Javadoc for isLinkBroken to clearly define what “broken” means:

HTTP request failure

HTTP status codes outside the 200–399 range

Clarified that:

Status 200 is considered valid even with an empty body

Response body content is not inspected

Invalid URL formats result in an IllegalArgumentException

HEAD / GET request logic

HEAD is used as an initial, cheaper check

If HEAD indicates failure, a GET request is used as a fallback to handle servers that don’t implement HEAD correctly

Any exception during either request is treated as a broken link

Naming improvements

Renamed request variables to describe intent rather than HTTP mechanics (e.g. link status validation rather than fallback flow)

Asynchronous behavior

replaceDeadLinks performs all link checks asynchronously

CompletableFuture.allOf(...) is used only as a synchronization point; no blocking is introduced

thenApply(_ -> …) is intentionally used to ignore the unused result and avoid misleading variable names

replaceDeadLinks behavior

Added a concrete usage example to the Javadoc showing input → output transformation

Only links detected as broken are replaced

Working links and non-URL content remain unchanged

Duplicate links are checked once and replaced consistently

Edge cases

Empty input or no detected links returns the original text

Empty response bodies are not treated as broken

Exceptions during HTTP checks are handled defensively and result in “broken”

sorry for stopping to answer those comments one by one it was annoying and frustrating lol

* @return a future containing the modified text
*/

public static CompletableFuture<String> replaceDeadLinks(String text, String replacement) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not done yet

Comment on lines +150 to +153
List<CompletableFuture<String>> deadLinkFutures = links.stream()
.distinct()
.map(link -> isLinkBroken(link)
.thenApply(isBroken -> Boolean.TRUE.equals(isBroken) ? link : null))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not done yet

return CompletableFuture.allOf(deadLinkFutures.toArray(new CompletableFuture[0]))
.thenApply(ignored -> deadLinkFutures.stream()
.map(CompletableFuture::join)
.filter(Objects::nonNull)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not done yet

Updated isLinkBroken() to only treat 4xx/5xx status codes as broken.
Previously 3xx redirects were incorrectly marked as broken links

also improved javadoc clarity throughout LinkDetection class
@barsh404error
Copy link
Author

hello, @Zabuzard in this commit i focused mostly on the javadocs and updated isLinkBroken() to only treat 4xx/5xx status codes as broken.
let me know if the javadocs makes sense or not, also with any few changes if your request :]

Comment on lines +39 to +51
* <p>
* These filters intentionally ignore:
* <ul>
* <li>Suppressed links like {@code <https://example.com>}</li>
* <li>Non-HTTP(S) schemes such as {@code ftp://} or {@code file://}</li>
* </ul>
*
* @see LinkDetection
* <p>
* This reduces false positives when scanning chat messages or source-code snippets.
*/

private static final Set<LinkFilter> DEFAULT_FILTERS =
Set.of(LinkFilter.SUPPRESSED, LinkFilter.NON_HTTP_SCHEME);
Copy link
Member

@Zabuzard Zabuzard Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont list the current content of the filters. the problem with such a doc is that it easily rots.

like, in two years someone will have added a new filter and removed another filter, without adjusting the javadoc... oops.

its great that you have explained the individual filters a few lines below, thats enough. for the DEFAULT_FILTERS i would instead just describe:

Links to intentionally ignore in order to reduce false positives when scanning chat messages or source-code snippets.


/**
* Extracts all links from the given content.
* Extracts HTTP(S) links from the given text.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too specific. this assumes that the link-filter remains setup as it currently is. what if in the future its also running on ftp links, then this javadoc will definitely be forgotten to be adjusted and be wrong then.

just keep it as

Extracts links from the given text

Comment on lines 99 to 101
public static List<String> extractLinks(String content, Set<LinkFilter> filter) {
return new UrlDetector(content, UrlDetectorOptions.BRACKET_MATCH).detect()
.stream()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make an overload of this method that uses the default filter right away.

public static List<String> extractLinks(String content) {
  return extractLinks(content, DEFAULT_LINK_FILTER);
}

Comment on lines +124 to +129
* <p>
* The check is performed in two steps:
* <ol>
* <li>A {@code HEAD} request is sent first (cheap and fast)</li>
* <li>If that fails or returns an error, a {@code GET} request is used as a fallback</li>
* </ol>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementation detail. irrelevant for the user of the method. relevant for the maintainers of the code. so move this part as comment into the method instead.

Comment on lines +243 to +248
* <p>
* Applies the provided {@link LinkFilter}s:
* <ul>
* <li>{@link LinkFilter#SUPPRESSED} - filters URLs wrapped in angle brackets</li>
* <li>{@link LinkFilter#NON_HTTP_SCHEME} - filters non-HTTP(S) schemes</li>
* </ul>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assumes the content of the filter to remain the same over the years, dont do that. instead just link the variable.

in fact, the doc is already incorrect as it does currently NOT apply these filters. it applies what the user gives as parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants