mercredi 18 mars 2020

How do I create a stream of fake data entering an Apache Beam pipeline at intervals of time?

I'm trying to create small Apache Beam streaming programs to test out ideas and I think the simplest thing for me to do to get data in would be to use framework constructs like Create.of to create fake data. That way, I wouldn't have to set up more than I need to, like setting up a GCP Pub/Sub topic as a source and publishing to it.

The problem is I want to try out things that are time based, like windowing and working with state and timers. I was able to put this together:

public class TestPipeline {
    public static void main(String[] args) {
        PipelineOptions options = PipelineOptionsFactory.create();
        Pipeline p = Pipeline.create(options);

        p.apply(Create.of(1, 2, 3))
            .apply(ParDo.of(new DoFn<Integer, String>() {
                @ProcessElement
                public void processElement(ProcessContext c) {
                    c.output(c.element().toString());
                }
            }))
            .apply(TextIO.write().to("myfile.txt"));

        p.run().waitUntilFinish();
    }
}

This accomplishes my goal of sending off three pieces of data at the beginning of my pipeline, but it sends them all at once. I'd prefer if I could set it to send each piece of data every 10 seconds, etc.

I followed this tutorial from Apache Flink (https://ci.apache.org/projects/flink/flink-docs-release-1.10/getting-started/walkthroughs/datastream_api.html) which shows an example of what I'm trying to accomplish. I dug into the code in that tutorial but I wasn't able to find out exactly which part of the Flink framework made that happen.

Aucun commentaire:

Enregistrer un commentaire