[FLINK-33100] Implement YarnJobListFetcher by Samrat002 · Pull Request #1031 · apache/flink-kubernetes-operator

Samrat002 · 2025-09-22T20:41:56Z

What is the purpose of the change

YarnJobListFetcher Implementation

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changes to the CustomResourceDescriptors: (yes / no)
Core observer or reconciler logic that is regularly executed: (yes / no)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

1996fanrui

Hey @Samrat002 , thanks for picking it up! and sorry for the late review.

I have some questions about this PR, please help take a look when you are available, thanks

1996fanrui · 2025-10-02T16:22:54Z

+--autoscaler.standalone.fetcher.type FLINK_CLUSTER|YARN
+```
+
+When running against Flink-on-YARN (`YARN`), set the host/port to the YARN web proxy endpoint that exposes the JobManager REST API.


set the host/port to the YARN web proxy endpoint

Do you mean autoscaler.standalone.fetcher.flink-cluster.host and autoscaler.standalone.fetcher.flink-cluster.port?

If yes, it does not make sense. Because all config options with autoscaler.standalone.fetcher.flink-cluster prefix are related to flink-cluster. It is better to introduce yarn cluster related config options.

1996fanrui · 2025-10-02T16:28:50Z

+To select the job fetcher use:
+
+```
+--autoscaler.standalone.fetcher.type FLINK_CLUSTER|YARN


How about introducing a whole demo for yarn mode?

1996fanrui · 2025-10-02T16:29:37Z

-We will implement `YarnJobListFetcher` in the future, `Flink Autoscaler Standalone` will call 
-`YarnJobListFetcher#fetch` to fetch job list from yarn cluster periodically.
+Currently `FlinkClusterJobListFetcher` and `YarnJobListFetcher` are implementations of the 
+`JobListFetcher` interface. that's why `Flink Autoscaler Standalone` only supports a single Flink cluster so far. 


that's why Flink Autoscaler Standalone only supports a single Flink cluster so far.

It no longer makes sense after adding YARN support, and the sentence should either be removed or rewritten to explain that each fetcher instance still monitors a single cluster or YARN deployment.

1996fanrui · 2025-10-02T16:35:24Z

+            default:
+                return (JobListFetcher<KEY, Context>)
+                        new FlinkClusterJobListFetcher(
+                                clientSupplier, conf.get(FLINK_CLIENT_TIMEOUT));


The default value of AutoscalerStandaloneOptions.FETCHER_TYPE is FLINK_CLUSTER, so including default case that falls back to FLINK_CLUSTER here does not make sense, because it silently accepts invalid configuration values. Throwing an exception for unknown fetcher types is better. It could prevent potential bugs if introducing new type in the future.

1996fanrui · 2025-10-02T16:40:59Z

+    public static final ConfigOption<FetcherType> FETCHER_TYPE =
+            autoscalerStandaloneConfig("fetcher.type")
+                    .enumType(FetcherType.class)
+                    .defaultValue(FetcherType.FLINK_CLUSTER)
+                    .withDescription(
+                            "The job list fetcher type to use. Supported values: FLINK_CLUSTER, YARN.");


https://github.com/apache/flink-kubernetes-operator/blob/main/docs/README.md

Please generate docs according to this doc. Also, IIRC, it is not needed to mentioned values, and doc tools will list all values by default.

1996fanrui · 2025-10-14T19:58:26Z

+        } catch (Throwable ignore) {
+            // Ignore


It suppresses all exceptions including critical ones like OutOfMemoryError without any logging, making it impossible to diagnose why YARN-based job discovery failed, such as: do not know if there are configuration issues, network problems, or authentication failures.

1996fanrui · 2025-10-14T20:00:27Z

+            return discovered;
+        }
+
+        // use supplied client factory (may point to direct JM or a reverse proxy)


why fallback to JM or flink cluster here? If this is what the user expects, why choosing yarn cluster fetcher instead of flink cluster fetcher?

1996fanrui · 2025-10-14T20:03:07Z

+            yarnClient = YarnClient.createYarnClient();
+            org.apache.hadoop.conf.Configuration yarnConf =
+                    new org.apache.hadoop.conf.Configuration();
+            yarnClient.init(yarnConf);
+            yarnClient.start();


Creating YarnClient without any Hadoop configuration, I am not sure whether it works. Generally, it needs Hadoop configuration files like core-site.xml or yarn-site.xml that might be present in the classpath.

1996fanrui · 2025-10-14T20:08:24Z

+                }
+                break;
+            }
+        } catch (Throwable ignore) {


This catch does not provide fault isolation among jobs or yarn applications, if one job is stuck on GC or something else, the autoscaler won't work for all applciations.

1996fanrui · 2025-10-14T20:11:04Z

+        <dependency>
+            <groupId>org.apache.flink</groupId>
+            <artifactId>flink-yarn</artifactId>
+            <version>${flink.version}</version>
+        </dependency>


Is it possible to minimize the scope of dependencies? For example, only yarn-client is added here.

Also, is it needed to exclude some dependencies to avoid dependency conflicts?

gyfora · 2026-04-14T14:49:47Z

@Samrat002 are you still working on this?

Samrat002 marked this pull request as ready for review September 23, 2025 05:17

1996fanrui self-assigned this Sep 23, 2025

[FLINK-33100] Implement YarnJobListFetcher

87eaa35

Samrat002 force-pushed the FLINK-33100 branch from 6ba949e to 87eaa35 Compare September 27, 2025 09:03

1996fanrui reviewed Oct 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-33100] Implement YarnJobListFetcher#1031

[FLINK-33100] Implement YarnJobListFetcher#1031
Samrat002 wants to merge 1 commit intoapache:mainfrom
Samrat002:FLINK-33100

Samrat002 commented Sep 22, 2025

Uh oh!

1996fanrui left a comment

Uh oh!

1996fanrui Oct 2, 2025

Uh oh!

1996fanrui Oct 2, 2025

Uh oh!

1996fanrui Oct 2, 2025

Uh oh!

1996fanrui Oct 2, 2025

Uh oh!

1996fanrui Oct 2, 2025

Uh oh!

1996fanrui Oct 14, 2025

Uh oh!

1996fanrui Oct 14, 2025

Uh oh!

1996fanrui Oct 14, 2025

Uh oh!

1996fanrui Oct 14, 2025

Uh oh!

1996fanrui Oct 14, 2025

Uh oh!

gyfora commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Samrat002 commented Sep 22, 2025

What is the purpose of the change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

1996fanrui left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gyfora commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants