TextIO.Write - does it append to or replace the output files (Google Cloud Dataflow) -


i cannot find documentation on it, wonder behavior if output files exist (in gs:// bucket)?

thanks, g

the files overwritten. there several motivations this:

  • the "report-like" use case (compute summary of input data , put results on gcs) seems lot more frequent use case producing data incrementally , putting more of onto gcs each execution of pipeline.
  • it if rerunning pipeline idempotent(-ish?). e.g. if find bug in pipeline, can fix , rerun it, , enjoy overwritten correct results. pipeline appends files difficult work in matter.
  • it not required specify number of output shards textio.write; can differ between different executions, same pipeline , same input data. semantics of appending in case confusing.
  • appending is, far know, impossible implement efficiently using filesystem i'm aware of, while preserving atomicity , fault tolerance guarantees (e.g. produce output or none of it, in face of bundle re-executions due failures).

this behavior documented in next version of sdk appears on github.


Comments

Popular posts from this blog

node.js - Using Node without global install -

How to access a php class file from PHPFox framework into javascript code written in simple HTML file? -

java - Null response to php query in android, even though php works properly -