-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] Remote filesystem view only works on the first micro-batch in Spark Structured Streaming #12652
Comments
To add some more context, we tried using |
Maybe @the-other-tim-brown can take a look. |
I am not familiar with the spark streaming code but these are lines I find suspicious as @psendyk is seeing that the configs reset to the default for the remote FSView which happens when the write client is closed:
It may be better to return |
Describe the problem you faced
After enabling timeline server and using
REMOTE_ONLY
file system view, Spark Structured Streaming ingestion into Hudi fails on the second microbatch withorg.apache.hudi.exception.HoodieRemoteException: Connect to localhost:26754 [localhost/127.0.0.1] failed: Connection refused
To Reproduce
Steps to reproduce the behavior:
Getting small files from partitions
stageExpected behavior
The application should continue without failure
Environment Description
Hudi version : 0.15.0
Spark version : 3.3.0
Hive version :
Hadoop version : 3.2.1
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
This issue occurs because the timeline service is stopped after the first write and is never restarted. Specifically, the write client is closed, which then closes the timeline server in the
finally
block inHoodieSparkSqlWriterInternal.writeInternal
. Here's a stack trace showing how this part of the code is reached:The timeline server is only started when instantiating
HoodieBaseClient
viastartEmbeddedServerView
but the write client is reused across batches withinHoodieStreamingSink.addBatch
. Therefore, the second batch uses a client that has closed the timeline server. The error may first seem like a config issue but it only looks like this because the view storage config is updated with the timeline service config when the service is started, then reset to the client-provided config when the service is stopped. We can see the correct config pointing to the timeline service is used for the first write:but not for the second write:
I was able to work around this failure mode by modifying
HoodieSparkSqlWriterInternal.writeInternal
to restart the timeline server at the beginning of each write but as you can imagine this only gives us partial improvement (within a batch but not across batches). I'm wondering if there's a reason why the write client, and subsequently timeline server, is stopped after each batch?Stacktrace
The text was updated successfully, but these errors were encountered: