TextIO.Write - does it append to or replace the output files (Google Cloud Dataflow) -
i cannot find documentation on it, wonder behavior if output files exist (in gs:// bucket)?
thanks, g
the files overwritten. there several motivations this:
- the "report-like" use case (compute summary of input data , put results on gcs) seems lot more frequent use case producing data incrementally , putting more of onto gcs each execution of pipeline.
- it if rerunning pipeline idempotent(-ish?). e.g. if find bug in pipeline, can fix , rerun it, , enjoy overwritten correct results. pipeline appends files difficult work in matter.
- it not required specify number of output shards textio.write; can differ between different executions, same pipeline , same input data. semantics of appending in case confusing.
- appending is, far know, impossible implement efficiently using filesystem i'm aware of, while preserving atomicity , fault tolerance guarantees (e.g. produce output or none of it, in face of bundle re-executions due failures).
this behavior documented in next version of sdk appears on github.
Comments
Post a Comment