Add utilities to detect and replace broken links.#1366
Add utilities to detect and replace broken links.#1366barsh404error wants to merge 8 commits intoTogether-Java:developfrom
Conversation
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Outdated
Show resolved
Hide resolved
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Outdated
Show resolved
Hide resolved
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Outdated
Show resolved
Hide resolved
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Outdated
Show resolved
Hide resolved
|
@tj-wazei I've added your request and fixed them |
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Show resolved
Hide resolved
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Outdated
Show resolved
Hide resolved
Zabuzard
left a comment
There was a problem hiding this comment.
The added methods are utility methods and for those it is crucial that they have a proper Javadoc explaining what they do and give examples.
The isLinkBroken method needs to explain what it means for a link to be broken and how it behaves in edge case, for example if the given text isnt a url and so on.
The other method needs to explain in more detail what the two parameters are and how they work, maybe giving a concrete example in the javadoc.
Yes, i got that solved made javadoc for |
| int status = response.statusCode(); | ||
| return status < 200 || status >= 400; | ||
| }) | ||
| .exceptionally(ignored -> true) |
There was a problem hiding this comment.
the idiomatic name for something you ignore is _
There was a problem hiding this comment.
Renamed ignored lambda parameters to _ where applicable (e.g. exceptionally(_ -> ...), thenApply(_ -> ...)) to clearly indicate intentional non-usage.
There was a problem hiding this comment.
you forgot one:
.exceptionally(ignored -> true); // still never null
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Show resolved
Hide resolved
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Show resolved
Hide resolved
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Show resolved
Hide resolved
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Outdated
Show resolved
Hide resolved
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Show resolved
Hide resolved
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Show resolved
Hide resolved
| List<CompletableFuture<String>> deadLinkFutures = links.stream() | ||
| .distinct() | ||
| .map(link -> isLinkBroken(link) | ||
| .thenApply(isBroken -> Boolean.TRUE.equals(isBroken) ? link : null)) |
There was a problem hiding this comment.
better: instead of map(foo -> ...thenApply(...)), use .map(foo -> ...).filter(...) then you also dont have all these null items in ur list, polluting it
| return CompletableFuture.allOf(deadLinkFutures.toArray(new CompletableFuture[0])) | ||
| .thenApply(ignored -> deadLinkFutures.stream() | ||
| .map(CompletableFuture::join) | ||
| .filter(Objects::nonNull) |
There was a problem hiding this comment.
.filter(Objects::nonNull) that one isnt needed anymore with the above fix
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Show resolved
Hide resolved
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Show resolved
Hide resolved
|
I focused primarily on improving the Javadocs as requested @Zabuzard |
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Outdated
Show resolved
Hide resolved
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Outdated
Show resolved
Hide resolved
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Outdated
Show resolved
Hide resolved
|
Thanks for the detailed review @Zabuzard I went through the comments one by one and addressed them explicitly: Javadoc & behavior clarification Expanded Javadoc for isLinkBroken to clearly define what “broken” means: HTTP request failure HTTP status codes outside the 200–399 range Clarified that: Status 200 is considered valid even with an empty body Response body content is not inspected Invalid URL formats result in an IllegalArgumentException HEAD / GET request logic HEAD is used as an initial, cheaper check If HEAD indicates failure, a GET request is used as a fallback to handle servers that don’t implement HEAD correctly Any exception during either request is treated as a broken link Naming improvements Renamed request variables to describe intent rather than HTTP mechanics (e.g. link status validation rather than fallback flow) Asynchronous behavior replaceDeadLinks performs all link checks asynchronously CompletableFuture.allOf(...) is used only as a synchronization point; no blocking is introduced thenApply(_ -> …) is intentionally used to ignore the unused result and avoid misleading variable names replaceDeadLinks behavior Added a concrete usage example to the Javadoc showing input → output transformation Only links detected as broken are replaced Working links and non-URL content remain unchanged Duplicate links are checked once and replaced consistently Edge cases Empty input or no detected links returns the original text Empty response bodies are not treated as broken Exceptions during HTTP checks are handled defensively and result in “broken” sorry for stopping to answer those comments one by one it was annoying and frustrating lol |
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Outdated
Show resolved
Hide resolved
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Outdated
Show resolved
Hide resolved
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Show resolved
Hide resolved
| * @return a future containing the modified text | ||
| */ | ||
|
|
||
| public static CompletableFuture<String> replaceDeadLinks(String text, String replacement) { |
| List<CompletableFuture<String>> deadLinkFutures = links.stream() | ||
| .distinct() | ||
| .map(link -> isLinkBroken(link) | ||
| .thenApply(isBroken -> Boolean.TRUE.equals(isBroken) ? link : null)) |
| return CompletableFuture.allOf(deadLinkFutures.toArray(new CompletableFuture[0])) | ||
| .thenApply(ignored -> deadLinkFutures.stream() | ||
| .map(CompletableFuture::join) | ||
| .filter(Objects::nonNull) |
application/src/main/java/org/togetherjava/tjbot/features/utils/LinkDetection.java
Outdated
Show resolved
Hide resolved
1d41d50 to
d556a33
Compare
Updated isLinkBroken() to only treat 4xx/5xx status codes as broken. Previously 3xx redirects were incorrectly marked as broken links also improved javadoc clarity throughout LinkDetection class
|
hello, @Zabuzard in this commit i focused mostly on the javadocs and updated |
| * <p> | ||
| * These filters intentionally ignore: | ||
| * <ul> | ||
| * <li>Suppressed links like {@code <https://example.com>}</li> | ||
| * <li>Non-HTTP(S) schemes such as {@code ftp://} or {@code file://}</li> | ||
| * </ul> | ||
| * | ||
| * @see LinkDetection | ||
| * <p> | ||
| * This reduces false positives when scanning chat messages or source-code snippets. | ||
| */ | ||
|
|
||
| private static final Set<LinkFilter> DEFAULT_FILTERS = | ||
| Set.of(LinkFilter.SUPPRESSED, LinkFilter.NON_HTTP_SCHEME); |
There was a problem hiding this comment.
dont list the current content of the filters. the problem with such a doc is that it easily rots.
like, in two years someone will have added a new filter and removed another filter, without adjusting the javadoc... oops.
its great that you have explained the individual filters a few lines below, thats enough. for the DEFAULT_FILTERS i would instead just describe:
Links to intentionally ignore in order to reduce false positives when scanning chat messages or source-code snippets.
|
|
||
| /** | ||
| * Extracts all links from the given content. | ||
| * Extracts HTTP(S) links from the given text. |
There was a problem hiding this comment.
too specific. this assumes that the link-filter remains setup as it currently is. what if in the future its also running on ftp links, then this javadoc will definitely be forgotten to be adjusted and be wrong then.
just keep it as
Extracts links from the given text
| public static List<String> extractLinks(String content, Set<LinkFilter> filter) { | ||
| return new UrlDetector(content, UrlDetectorOptions.BRACKET_MATCH).detect() | ||
| .stream() |
There was a problem hiding this comment.
make an overload of this method that uses the default filter right away.
public static List<String> extractLinks(String content) {
return extractLinks(content, DEFAULT_LINK_FILTER);
}| * <p> | ||
| * The check is performed in two steps: | ||
| * <ol> | ||
| * <li>A {@code HEAD} request is sent first (cheap and fast)</li> | ||
| * <li>If that fails or returns an error, a {@code GET} request is used as a fallback</li> | ||
| * </ol> |
There was a problem hiding this comment.
implementation detail. irrelevant for the user of the method. relevant for the maintainers of the code. so move this part as comment into the method instead.
| * <p> | ||
| * Applies the provided {@link LinkFilter}s: | ||
| * <ul> | ||
| * <li>{@link LinkFilter#SUPPRESSED} - filters URLs wrapped in angle brackets</li> | ||
| * <li>{@link LinkFilter#NON_HTTP_SCHEME} - filters non-HTTP(S) schemes</li> | ||
| * </ul> |
There was a problem hiding this comment.
this assumes the content of the filter to remain the same over the years, dont do that. instead just link the variable.
in fact, the doc is already incorrect as it does currently NOT apply these filters. it applies what the user gives as parameter.
Thanks alot to @christolis for helping me out on making this pull request.
Added two utulity methods
isLinkBrokenandreplaceDeadLinks-
isLinkBroken(String url) checks the link availability using a HEAD requestI used HEAD request instead of GET request to check link availability without downloading the response body, reducing bandwidth and improving the performance.
-
replaceDeadLinks(String text, String replacement) replaces unreachable/broken links asynchronously.This change does not have any behavior changes to the existing code.
Part of #1276, implements the mentioned utility but doesnt apply it.