您好,登錄后才能下訂單哦!
本篇內(nèi)容介紹了“Flink技術(shù)的使用方法有哪些”的有關(guān)知識,在實(shí)際案例的操作過程中,不少人都會遇到這樣的困境,接下來就讓小編帶領(lǐng)大家學(xué)習(xí)一下如何處理這些情況吧!希望大家仔細(xì)閱讀,能夠?qū)W有所成!
首先先拉取Flink的樣例代碼
mvn archetype:generate \ -DarchetypeGroupId=org.apache.flink \ -DarchetypeArtifactId=flink-quickstart-java \ -DarchetypeVersion=1.7.2 \ -DarchetypeCatalog=local
實(shí)現(xiàn)從文件讀取的批處理
建立一個hello.txt,文件內(nèi)容如下
hello world welcome
hello welcome
統(tǒng)計(jì)詞頻
public class BatchJavaApp {public static void main(String[] args) throws Exception { String input = "/Users/admin/Downloads/flink/data/hello.txt"; ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSource<String> text = env.readTextFile(input); text.print(); text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {@Override public void flatMap(String value, Collector<Tuple2<String, Integer>> collector) throws Exception { String[] tokens = value.toLowerCase().split(" "); for (String token : tokens) { collector.collect(new Tuple2<>(token,1)); } } }).groupBy(0).sum(1).print(); } }
運(yùn)行結(jié)果(日志省略)
hello welcome hello world welcome (world,1) (hello,2) (welcome,2)
純Java實(shí)現(xiàn)
文件讀取類
public class FileOperation {/** * 讀取文件名稱為filename中的內(nèi)容,并將其中包含的所有詞語放進(jìn)words中 * @param filename * @param words * @return */ public static boolean readFile(String filename, List<String> words) {if (filename == null || words == null) { System.out.println("filename為空或words為空"); return false; } Scanner scanner; try { File file = new File(filename); if (file.exists()) { FileInputStream fis = new FileInputStream(file); scanner = new Scanner(new BufferedInputStream(fis),"UTF-8"); scanner.useLocale(Locale.ENGLISH); }else {return false; } } catch (FileNotFoundException e) { System.out.println("無法打開" + filename); return false; }//簡單分詞 if (scanner.hasNextLine()) { String contents = scanner.useDelimiter("\\A").next(); int start = firstCharacterIndex(contents,0); for (int i = start + 1;i <= contents.length();) {if (i == contents.length() || !Character.isLetter(contents.charAt(i))) { String word = contents.substring(start,i).toLowerCase(); words.add(word); start = firstCharacterIndex(contents,i); i = start + 1; }else { i++; } } }return true; }private static int firstCharacterIndex(String s,int start) {for (int i = start;i < s.length();i++) {if (Character.isLetter(s.charAt(i))) {return i; } }return s.length(); } }
public class BatchJavaOnly {public static void main(String[] args) { String input = "/Users/admin/Downloads/flink/data/hello.txt"; List<String> list = new ArrayList<>(); FileOperation.readFile(input,list); System.out.println(list); Map<String,Integer> map = new HashMap<>(); list.forEach(token -> {if (map.containsKey(token)) {map.put(token,map.get(token) + 1); }else {map.put(token,1); } }); map.entrySet().stream().map(entry -> new Tuple2<>(entry.getKey(),entry.getValue())) .forEach(System.out::println); } }
運(yùn)行結(jié)果
[hello, world, welcome, hello, welcome] (world,1) (hello,2) (welcome,2)
Scala代碼
拉取Scala樣例代碼
mvn archetype:generate \ -DarchetypeGroupId=org.apache.flink \ -DarchetypeArtifactId=flink-quickstart-scala \ -DarchetypeVersion=1.7.2 \ -DarchetypeCatalog=local
安裝好Scala插件和配置好Scala SDK后
import org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._object BatchScalaApp { def main(args: Array[String]): Unit = {val input = "/Users/admin/Downloads/flink/data/hello.txt" val env = ExecutionEnvironment.getExecutionEnvironment val text = env.readTextFile(input) text.flatMap(_.toLowerCase.split(" ")) .filter(_.nonEmpty) .map((_,1)) .groupBy(0) .sum(1) .print() } }
運(yùn)行結(jié)果(省略日志)
(world,1) (hello,2) (welcome,2)
Scala基礎(chǔ)內(nèi)容請參考Scala入門篇 Scala入門之面向?qū)ο?/p>
從網(wǎng)絡(luò)傳輸?shù)牧魇教幚?/p>
public class StreamingJavaApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStreamSource<String> text = env.socketTextStream("127.0.0.1",9999); text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {@Override public void flatMap(String value, Collector<Tuple2<String, Integer>> collector) throws Exception { String[] tokens = value.toLowerCase().split(" "); for (String token : tokens) { collector.collect(new Tuple2<>(token,1)); } } }).keyBy(0).timeWindow(Time.seconds(5)) .sum(1).print(); env.execute("StreamingJavaApp"); } }
運(yùn)行前打開端口
nc -lk 9999
運(yùn)行代碼,在nc命令輸入a a c d b c e e f a
運(yùn)行結(jié)果(日志省略)
1> (e,2) 9> (d,1) 11> (a,3) 3> (b,1) 4> (f,1) 8> (c,2)
Scala代碼
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.streaming.api.windowing.time.Timeimport org.apache.flink.api.scala._object StreamScalaApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment val text = env.socketTextStream("127.0.0.1",9999) text.flatMap(_.split(" ")) .map((_,1)) .keyBy(0) .timeWindow(Time.seconds(5)) .sum(1) .print() .setParallelism(1) env.execute("StreamScalaApp") } }
運(yùn)行結(jié)果(省略日志)
(c,2) (b,1) (d,1) (f,1) (e,2) (a,3)
現(xiàn)在我們將元組改成實(shí)體類
public class StreamObjJavaApp {@AllArgsConstructor @Data @ToString @NoArgsConstructor public static class WordCount {private String word; private int count; }public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStreamSource<String> text = env.socketTextStream("127.0.0.1",9999); text.flatMap(new FlatMapFunction<String, WordCount>() {@Override public void flatMap(String value, Collector<WordCount> collector) throws Exception { String[] tokens = value.toLowerCase().split(" "); for (String token : tokens) { collector.collect(new WordCount(token,1)); } } }).keyBy("word").timeWindow(Time.seconds(5)) .sum("count").print(); env.execute("StreamingJavaApp"); } }
運(yùn)行結(jié)果
4> StreamObjJavaApp.WordCount(word=f, count=1) 11> StreamObjJavaApp.WordCount(word=a, count=3) 8> StreamObjJavaApp.WordCount(word=c, count=2) 1> StreamObjJavaApp.WordCount(word=e, count=2) 9> StreamObjJavaApp.WordCount(word=d, count=1) 3> StreamObjJavaApp.WordCount(word=b, count=1)
當(dāng)然我們也可以這么寫
public class StreamObjJavaApp {@AllArgsConstructor @Data @ToString @NoArgsConstructor public static class WordCount {private String word; private int count; }public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStreamSource<String> text = env.socketTextStream("127.0.0.1",9999); text.flatMap(new FlatMapFunction<String, WordCount>() {@Override public void flatMap(String value, Collector<WordCount> collector) throws Exception { String[] tokens = value.toLowerCase().split(" "); for (String token : tokens) { collector.collect(new WordCount(token,1)); } } }).keyBy(WordCount::getWord).timeWindow(Time.seconds(5)) .sum("count").print().setParallelism(1); env.execute("StreamingJavaApp"); } }
keyBy里面是一個函數(shù)式接口KeySelector
@Public@FunctionalInterfacepublic interface KeySelector<IN, KEY> extends Function, Serializable { KEY getKey(IN value) throws Exception;}
flatMap的lambda表達(dá)式寫法,比較繁瑣,不如匿名類的寫法
public class StreamObjJavaApp {@AllArgsConstructor @Data @ToString @NoArgsConstructor public static class WordCount {private String word; private int count; }public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStreamSource<String> text = env.socketTextStream("127.0.0.1",9999); text.flatMap((FlatMapFunction<String,WordCount>)(value,collector) -> { String[] tokens = value.toLowerCase().split(" "); for (String token : tokens) { collector.collect(new WordCount(token,1)); } }).returns(WordCount.class) .keyBy(WordCount::getWord).timeWindow(Time.seconds(5)) .sum("count").print().setParallelism(1); env.execute("StreamingJavaApp"); } }
flatMap還可以使用RichFlatMapFunction抽象類
public class StreamObjJavaApp {@AllArgsConstructor @Data @ToString @NoArgsConstructor public static class WordCount {private String word; private int count; }public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStreamSource<String> text = env.socketTextStream("127.0.0.1",9999); text.flatMap(new RichFlatMapFunction<String, WordCount>() {@Override public void flatMap(String value, Collector<WordCount> collector) throws Exception { String[] tokens = value.toLowerCase().split(" "); for (String token : tokens) { collector.collect(new WordCount(token,1)); } } }).keyBy(WordCount::getWord).timeWindow(Time.seconds(5)) .sum("count").print().setParallelism(1); env.execute("StreamingJavaApp"); } }
Scala代碼
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.streaming.api.windowing.time.Timeimport org.apache.flink.api.scala._object StreamObjScalaApp { case class WordCount(word: String,count: Int) def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment val text = env.socketTextStream("127.0.0.1",9999) text.flatMap(_.split(" ")) .map(WordCount(_,1)) .keyBy("word") .timeWindow(Time.seconds(5)) .sum("count") .print() .setParallelism(1) env.execute("StreamScalaApp") } }
運(yùn)行結(jié)果(省略日志)
WordCount(b,1) WordCount(d,1) WordCount(e,2) WordCount(f,1) WordCount(a,3) WordCount(c,2)
數(shù)據(jù)源
從集合獲取數(shù)據(jù)
public class DataSetDataSourceApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); fromCollection(env); }public static void fromCollection(ExecutionEnvironment env) throws Exception { List<Integer> list = new ArrayList<>(); for (int i = 1; i <= 10; i++) { list.add(i); } env.fromCollection(list).print(); } }
運(yùn)行結(jié)果(省略日志)
1 2 3 4 5 6 7 8 9 10
Scala代碼
import org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._object DataSetDataSourceApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment fromCollection(env) } def fromCollection(env: ExecutionEnvironment): Unit = {val data = 1 to 10 env.fromCollection(data).print() } }
運(yùn)行結(jié)果(省略日志)
1 2 3 4 5 6 7 8 9 10
從文件獲取數(shù)據(jù)
public class DataSetDataSourceApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// fromCollection(env); textFile(env); }public static void textFile(ExecutionEnvironment env) throws Exception { String filePath = "/Users/admin/Downloads/flink/data/hello.txt"; env.readTextFile(filePath).print(); }public static void fromCollection(ExecutionEnvironment env) throws Exception { List<Integer> list = new ArrayList<>(); for (int i = 1; i <= 10; i++) { list.add(i); } env.fromCollection(list).print(); } }
運(yùn)行結(jié)果(省略日志)
hello welcome hello world welcome
Scala代碼
import org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._object DataSetDataSourceApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// fromCollection(env) textFile(env) } def textFile(env: ExecutionEnvironment): Unit = {val filePath = "/Users/admin/Downloads/flink/data/hello.txt" env.readTextFile(filePath).print() } def fromCollection(env: ExecutionEnvironment): Unit = {val data = 1 to 10 env.fromCollection(data).print() } }
運(yùn)行結(jié)果(省略日志)
hello welcome hello world welcome
從csv文件獲取數(shù)據(jù)
在data目錄下新增一個people.csv,內(nèi)容如下
name,age,job Jorge,30,Developer Bob,32,Developer
public class DataSetDataSourceApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// fromCollection(env);// textFile(env); csvFile(env); }public static void csvFile(ExecutionEnvironment env) throws Exception { String filePath = "/Users/admin/Downloads/flink/data/people.csv"; env.readCsvFile(filePath).ignoreFirstLine() .types(String.class,Integer.class,String.class) .print(); }public static void textFile(ExecutionEnvironment env) throws Exception { String filePath = "/Users/admin/Downloads/flink/data/hello.txt"; env.readTextFile(filePath).print(); }public static void fromCollection(ExecutionEnvironment env) throws Exception { List<Integer> list = new ArrayList<>(); for (int i = 1; i <= 10; i++) { list.add(i); } env.fromCollection(list).print(); } }
運(yùn)行結(jié)果(省略日志)
(Bob,32,Developer) (Jorge,30,Developer)
Scala代碼
import org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._object DataSetDataSourceApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// fromCollection(env)// textFile(env) csvFile(env) } def csvFile(env: ExecutionEnvironment): Unit = {val filePath = "/Users/admin/Downloads/flink/data/people.csv" env.readCsvFile[(String,Int,String)](filePath,ignoreFirstLine = true).print() } def textFile(env: ExecutionEnvironment): Unit = {val filePath = "/Users/admin/Downloads/flink/data/hello.txt" env.readTextFile(filePath).print() } def fromCollection(env: ExecutionEnvironment): Unit = {val data = 1 to 10 env.fromCollection(data).print() } }
運(yùn)行結(jié)果(省略日志)
(Jorge,30,Developer) (Bob,32,Developer)
將結(jié)果放入實(shí)體類中
public class DataSetDataSourceApp {@AllArgsConstructor @Data @ToString @NoArgsConstructor public static class Case {private String name; private Integer age; private String job; }public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// fromCollection(env);// textFile(env); csvFile(env); }public static void csvFile(ExecutionEnvironment env) throws Exception { String filePath = "/Users/admin/Downloads/flink/data/people.csv";// env.readCsvFile(filePath).ignoreFirstLine()// .types(String.class,Integer.class,String.class)// .print(); env.readCsvFile(filePath).ignoreFirstLine() .pojoType(Case.class,"name","age","job") .print(); }public static void textFile(ExecutionEnvironment env) throws Exception { String filePath = "/Users/admin/Downloads/flink/data/hello.txt"; env.readTextFile(filePath).print(); }public static void fromCollection(ExecutionEnvironment env) throws Exception { List<Integer> list = new ArrayList<>(); for (int i = 1; i <= 10; i++) { list.add(i); } env.fromCollection(list).print(); } }
運(yùn)行結(jié)果(省略日志)
DataSetDataSourceApp.Case(name=Bob, age=32, job=Developer) DataSetDataSourceApp.Case(name=Jorge, age=30, job=Developer)
Scala代碼
import org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._object DataSetDataSourceApp { case class Case(name: String,age: Int,job: String) def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// fromCollection(env)// textFile(env) csvFile(env) } def csvFile(env: ExecutionEnvironment): Unit = {val filePath = "/Users/admin/Downloads/flink/data/people.csv"// env.readCsvFile[(String,Int,String)](filePath,ignoreFirstLine = true).print() env.readCsvFile[Case](filePath,ignoreFirstLine = true,includedFields = Array(0,1,2)) .print() } def textFile(env: ExecutionEnvironment): Unit = {val filePath = "/Users/admin/Downloads/flink/data/hello.txt" env.readTextFile(filePath).print() } def fromCollection(env: ExecutionEnvironment): Unit = {val data = 1 to 10 env.fromCollection(data).print() } }
運(yùn)行結(jié)果(省略日志)
Case(Bob,32,Developer) Case(Jorge,30,Developer)
獲取遞歸文件夾
我們在data目錄下新增兩個文件夾1、2,將hello.txt分別拷貝進(jìn)這兩個文件夾
public class DataSetDataSourceApp {@AllArgsConstructor @Data @ToString @NoArgsConstructor public static class Case {private String name; private Integer age; private String job; }public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// fromCollection(env);// textFile(env);// csvFile(env); readRecursiveFile(env); }public static void readRecursiveFile(ExecutionEnvironment env) throws Exception { String filePath = "/Users/admin/Downloads/flink/data"; Configuration parameters = new Configuration(); parameters.setBoolean("recursive.file.enumeration",true); env.readTextFile(filePath).withParameters(parameters) .print(); }public static void csvFile(ExecutionEnvironment env) throws Exception { String filePath = "/Users/admin/Downloads/flink/data/people.csv";// env.readCsvFile(filePath).ignoreFirstLine()// .types(String.class,Integer.class,String.class)// .print(); env.readCsvFile(filePath).ignoreFirstLine() .pojoType(Case.class,"name","age","job") .print(); }public static void textFile(ExecutionEnvironment env) throws Exception { String filePath = "/Users/admin/Downloads/flink/data/hello.txt"; env.readTextFile(filePath).print(); }public static void fromCollection(ExecutionEnvironment env) throws Exception { List<Integer> list = new ArrayList<>(); for (int i = 1; i <= 10; i++) { list.add(i); } env.fromCollection(list).print(); } }
運(yùn)行結(jié)果(省略日志)
hello world welcome hello world welcome hello welcome Jorge,30,Developer name,age,job hello world welcome hello welcome hello welcome Bob,32,Developer
Scala代碼
import org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.configuration.Configurationobject DataSetDataSourceApp { case class Case(name: String,age: Int,job: String) def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// fromCollection(env)// textFile(env)// csvFile(env) readRecursiveFiles(env) } def readRecursiveFiles(env: ExecutionEnvironment): Unit = {val filePath = "/Users/admin/Downloads/flink/data" val parameters = new Configuration parameters.setBoolean("recursive.file.enumeration",true) env.readTextFile(filePath).withParameters(parameters).print() } def csvFile(env: ExecutionEnvironment): Unit = {val filePath = "/Users/admin/Downloads/flink/data/people.csv"// env.readCsvFile[(String,Int,String)](filePath,ignoreFirstLine = true).print() env.readCsvFile[Case](filePath,ignoreFirstLine = true,includedFields = Array(0,1,2)) .print() } def textFile(env: ExecutionEnvironment): Unit = {val filePath = "/Users/admin/Downloads/flink/data/hello.txt" env.readTextFile(filePath).print() } def fromCollection(env: ExecutionEnvironment): Unit = {val data = 1 to 10 env.fromCollection(data).print() } }
運(yùn)行結(jié)果(省略日志)
hello world welcome hello world welcome hello welcome Jorge,30,Developer name,age,job hello world welcome hello welcome hello welcome Bob,32,Developer
獲取壓縮文件
在data文件夾下新建一個文件夾3,并壓縮hello.txt
gzip hello.txt
得到一個新的文件hello.txt.gz,將改文件放入3中
public class DataSetDataSourceApp {@AllArgsConstructor @Data @ToString @NoArgsConstructor public static class Case {private String name; private Integer age; private String job; }public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// fromCollection(env);// textFile(env);// csvFile(env);// readRecursiveFile(env); readCompresssionFile(env); }public static void readCompresssionFile(ExecutionEnvironment env) throws Exception { String filePath = "/Users/admin/Downloads/flink/data/3"; env.readTextFile(filePath).print(); }public static void readRecursiveFile(ExecutionEnvironment env) throws Exception { String filePath = "/Users/admin/Downloads/flink/data"; Configuration parameters = new Configuration(); parameters.setBoolean("recursive.file.enumeration",true); env.readTextFile(filePath).withParameters(parameters) .print(); }public static void csvFile(ExecutionEnvironment env) throws Exception { String filePath = "/Users/admin/Downloads/flink/data/people.csv";// env.readCsvFile(filePath).ignoreFirstLine()// .types(String.class,Integer.class,String.class)// .print(); env.readCsvFile(filePath).ignoreFirstLine() .pojoType(Case.class,"name","age","job") .print(); }public static void textFile(ExecutionEnvironment env) throws Exception { String filePath = "/Users/admin/Downloads/flink/data/hello.txt"; env.readTextFile(filePath).print(); }public static void fromCollection(ExecutionEnvironment env) throws Exception { List<Integer> list = new ArrayList<>(); for (int i = 1; i <= 10; i++) { list.add(i); } env.fromCollection(list).print(); } }
運(yùn)行結(jié)果
hello world welcome hello welcome
flink支持的壓縮格式有:.deflate,.gz,.gzip,.bz2,.xz
Scala代碼
import org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.configuration.Configurationobject DataSetDataSourceApp { case class Case(name: String,age: Int,job: String) def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// fromCollection(env)// textFile(env)// csvFile(env)// readRecursiveFiles(env) readCompressionFiles(env) } def readCompressionFiles(env: ExecutionEnvironment): Unit = {val filePath = "/Users/admin/Downloads/flink/data/3" env.readTextFile(filePath).print() } def readRecursiveFiles(env: ExecutionEnvironment): Unit = {val filePath = "/Users/admin/Downloads/flink/data" val parameters = new Configuration parameters.setBoolean("recursive.file.enumeration",true) env.readTextFile(filePath).withParameters(parameters).print() } def csvFile(env: ExecutionEnvironment): Unit = {val filePath = "/Users/admin/Downloads/flink/data/people.csv"// env.readCsvFile[(String,Int,String)](filePath,ignoreFirstLine = true).print() env.readCsvFile[Case](filePath,ignoreFirstLine = true,includedFields = Array(0,1,2)) .print() } def textFile(env: ExecutionEnvironment): Unit = {val filePath = "/Users/admin/Downloads/flink/data/hello.txt" env.readTextFile(filePath).print() } def fromCollection(env: ExecutionEnvironment): Unit = {val data = 1 to 10 env.fromCollection(data).print() } }
運(yùn)行結(jié)果
hello world welcome hello welcome
算子
map算子
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); mapFunction(env); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志)
2 3 4 5 6 7 8 9 10 11
Scala代碼
import org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._object DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment mapFunction(env) } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果(省略日志)
2 3 4 5 6 7 8 9 10 11
filter算子
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// mapFunction(env); filterFunction(env); }public static void filterFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).filter(x -> x > 5).print(); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志)
6 7 8 9 10 11
Scala代碼
import org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._object DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// mapFunction(env) filterFunction(env) } def filterFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).filter(_ > 5).print() } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果(省略日志)
6 7 8 9 10 11
mapPartition算子
按照并行度來分區(qū)返回結(jié)果
模擬一個數(shù)據(jù)庫連接的工具類
public class DBUntils {public static int getConnection() {return new Random().nextInt(10); }public static void returnConnection(int connection) { } }
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// mapFunction(env);// filterFunction(env); mapPartitionFunction(env); }public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception { List<String> students = new ArrayList<>(); for (int i = 0; i < 100; i++) { students.add("student: " + i); } DataSource<String> data = env.fromCollection(students).setParallelism(4); //此處會按照students的數(shù)量進(jìn)行轉(zhuǎn)換 data.map(student -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); return connection; }).print(); System.out.println("-----------"); //此處會按照并行度的數(shù)量進(jìn)行轉(zhuǎn)換 data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); collector.collect(connection); }).returns(Integer.class).print(); }public static void filterFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).filter(x -> x > 5).print(); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志,點(diǎn)號代表上面還有很多數(shù)字,橫線上方總共有100個)
. . . . 5 4 0 3 ----------- 5 5 0 3
Scala代碼
import scala.util.Randomobject DBUntils { def getConnection(): Int = {new Random().nextInt(10) } def returnConnection(connection: Int): Unit = { } }
import com.guanjian.flink.scala.until.DBUntilsimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.util.Collectorimport scala.collection.mutable.ListBufferobject DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// mapFunction(env)// filterFunction(env) mapPartitionFunction(env) } def mapPartitionFunction(env: ExecutionEnvironment): Unit = {val students = new ListBuffer[String]for(i <- 1 to 100) { students.append("student: " + i) }val data = env.fromCollection(students).setParallelism(4) data.map(student => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) connection }).print()println("-----------") data.mapPartition((student,collector: Collector[Int]) => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) collector.collect(connection) }).print() } def filterFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).filter(_ > 5).print() } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果
. . . . 5 4 0 3 ----------- 5 5 0 3
first算子
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// mapFunction(env);// filterFunction(env);// mapPartitionFunction(env); firstFunction(env); }public static void firstFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"), new Tuple2<>(1,"Spark"), new Tuple2<>(1,"Flink"), new Tuple2<>(2,"Java"), new Tuple2<>(2,"Spring boot"), new Tuple2<>(3,"Linux"), new Tuple2<>(4,"VUE")); DataSource<Tuple2<Integer, String>> data = env.fromCollection(info); data.first(3).print(); }public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception { List<String> students = new ArrayList<>(); for (int i = 0; i < 100; i++) { students.add("student: " + i); } DataSource<String> data = env.fromCollection(students).setParallelism(4); //此處會按照students的數(shù)量進(jìn)行轉(zhuǎn)換 data.map(student -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); return connection; }).print(); System.out.println("-----------"); //此處會按照并行度的數(shù)量進(jìn)行轉(zhuǎn)換 data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); collector.collect(connection); }).returns(Integer.class).print(); }public static void filterFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).filter(x -> x > 5).print(); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志)
(1,Hadoop) (1,Spark) (1,Flink)
Scala代碼
import com.guanjian.flink.scala.until.DBUntilsimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.util.Collectorimport scala.collection.mutable.ListBufferobject DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// mapFunction(env)// filterFunction(env)// mapPartitionFunction(env) firstFunction(env) } def firstFunction(env: ExecutionEnvironment): Unit = {val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"), (2,"Spring boot"),(3,"Linux"),(4,"VUE"))val data = env.fromCollection(info) data.first(3).print() } def mapPartitionFunction(env: ExecutionEnvironment): Unit = {val students = new ListBuffer[String]for(i <- 1 to 100) { students.append("student: " + i) }val data = env.fromCollection(students).setParallelism(4) data.map(student => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) connection }).print()println("-----------") data.mapPartition((student,collector: Collector[Int]) => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) collector.collect(connection) }).print() } def filterFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).filter(_ > 5).print() } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果(省略日志)
(1,Hadoop) (1,Spark) (1,Flink)
分組取前兩條
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// mapFunction(env);// filterFunction(env);// mapPartitionFunction(env); firstFunction(env); }public static void firstFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"), new Tuple2<>(1,"Spark"), new Tuple2<>(1,"Flink"), new Tuple2<>(2,"Java"), new Tuple2<>(2,"Spring boot"), new Tuple2<>(3,"Linux"), new Tuple2<>(4,"VUE")); DataSource<Tuple2<Integer, String>> data = env.fromCollection(info); data.groupBy(0).first(2).print(); }public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception { List<String> students = new ArrayList<>(); for (int i = 0; i < 100; i++) { students.add("student: " + i); } DataSource<String> data = env.fromCollection(students).setParallelism(4); //此處會按照students的數(shù)量進(jìn)行轉(zhuǎn)換 data.map(student -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); return connection; }).print(); System.out.println("-----------"); //此處會按照并行度的數(shù)量進(jìn)行轉(zhuǎn)換 data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); collector.collect(connection); }).returns(Integer.class).print(); }public static void filterFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).filter(x -> x > 5).print(); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志)
(3,Linux) (1,Hadoop) (1,Spark) (4,VUE) (2,Java) (2,Spring boot)
Scala代碼
import com.guanjian.flink.scala.until.DBUntilsimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.util.Collectorimport scala.collection.mutable.ListBufferobject DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// mapFunction(env)// filterFunction(env)// mapPartitionFunction(env) firstFunction(env) } def firstFunction(env: ExecutionEnvironment): Unit = {val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"), (2,"Spring boot"),(3,"Linux"),(4,"VUE"))val data = env.fromCollection(info) data.groupBy(0).first(2).print() } def mapPartitionFunction(env: ExecutionEnvironment): Unit = {val students = new ListBuffer[String]for(i <- 1 to 100) { students.append("student: " + i) }val data = env.fromCollection(students).setParallelism(4) data.map(student => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) connection }).print()println("-----------") data.mapPartition((student,collector: Collector[Int]) => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) collector.collect(connection) }).print() } def filterFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).filter(_ > 5).print() } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果(省略日志)
(3,Linux) (1,Hadoop) (1,Spark) (4,VUE) (2,Java) (2,Spring boot)
分組以后按升序取前兩條
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// mapFunction(env);// filterFunction(env);// mapPartitionFunction(env); firstFunction(env); }public static void firstFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"), new Tuple2<>(1,"Spark"), new Tuple2<>(1,"Flink"), new Tuple2<>(2,"Java"), new Tuple2<>(2,"Spring boot"), new Tuple2<>(3,"Linux"), new Tuple2<>(4,"VUE")); DataSource<Tuple2<Integer, String>> data = env.fromCollection(info); data.groupBy(0).sortGroup(1, Order.ASCENDING).first(2).print(); }public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception { List<String> students = new ArrayList<>(); for (int i = 0; i < 100; i++) { students.add("student: " + i); } DataSource<String> data = env.fromCollection(students).setParallelism(4); //此處會按照students的數(shù)量進(jìn)行轉(zhuǎn)換 data.map(student -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); return connection; }).print(); System.out.println("-----------"); //此處會按照并行度的數(shù)量進(jìn)行轉(zhuǎn)換 data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); collector.collect(connection); }).returns(Integer.class).print(); }public static void filterFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).filter(x -> x > 5).print(); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志)
(3,Linux) (1,Flink) (1,Hadoop) (4,VUE) (2,Java) (2,Spring boot)
Scala代碼
import com.guanjian.flink.scala.until.DBUntilsimport org.apache.flink.api.common.operators.Orderimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.util.Collectorimport scala.collection.mutable.ListBufferobject DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// mapFunction(env)// filterFunction(env)// mapPartitionFunction(env) firstFunction(env) } def firstFunction(env: ExecutionEnvironment): Unit = {val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"), (2,"Spring boot"),(3,"Linux"),(4,"VUE"))val data = env.fromCollection(info) data.groupBy(0).sortGroup(1,Order.ASCENDING).first(2).print() } def mapPartitionFunction(env: ExecutionEnvironment): Unit = {val students = new ListBuffer[String]for(i <- 1 to 100) { students.append("student: " + i) }val data = env.fromCollection(students).setParallelism(4) data.map(student => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) connection }).print()println("-----------") data.mapPartition((student,collector: Collector[Int]) => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) collector.collect(connection) }).print() } def filterFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).filter(_ > 5).print() } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果(省略日志)
(3,Linux) (1,Flink) (1,Hadoop) (4,VUE) (2,Java) (2,Spring boot)
分組以后按降序取前兩條
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// mapFunction(env);// filterFunction(env);// mapPartitionFunction(env); firstFunction(env); }public static void firstFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"), new Tuple2<>(1,"Spark"), new Tuple2<>(1,"Flink"), new Tuple2<>(2,"Java"), new Tuple2<>(2,"Spring boot"), new Tuple2<>(3,"Linux"), new Tuple2<>(4,"VUE")); DataSource<Tuple2<Integer, String>> data = env.fromCollection(info); data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print(); }public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception { List<String> students = new ArrayList<>(); for (int i = 0; i < 100; i++) { students.add("student: " + i); } DataSource<String> data = env.fromCollection(students).setParallelism(4); //此處會按照students的數(shù)量進(jìn)行轉(zhuǎn)換 data.map(student -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); return connection; }).print(); System.out.println("-----------"); //此處會按照并行度的數(shù)量進(jìn)行轉(zhuǎn)換 data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); collector.collect(connection); }).returns(Integer.class).print(); }public static void filterFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).filter(x -> x > 5).print(); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志)
(3,Linux) (1,Spark) (1,Hadoop) (4,VUE) (2,Spring boot) (2,Java)
Scala代碼
import com.guanjian.flink.scala.until.DBUntilsimport org.apache.flink.api.common.operators.Orderimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.util.Collectorimport scala.collection.mutable.ListBufferobject DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// mapFunction(env)// filterFunction(env)// mapPartitionFunction(env) firstFunction(env) } def firstFunction(env: ExecutionEnvironment): Unit = {val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"), (2,"Spring boot"),(3,"Linux"),(4,"VUE"))val data = env.fromCollection(info) data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print() } def mapPartitionFunction(env: ExecutionEnvironment): Unit = {val students = new ListBuffer[String]for(i <- 1 to 100) { students.append("student: " + i) }val data = env.fromCollection(students).setParallelism(4) data.map(student => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) connection }).print()println("-----------") data.mapPartition((student,collector: Collector[Int]) => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) collector.collect(connection) }).print() } def filterFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).filter(_ > 5).print() } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果(省略日志)
(3,Linux) (1,Spark) (1,Hadoop) (4,VUE) (2,Spring boot) (2,Java)
flatMap算子
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// mapFunction(env);// filterFunction(env);// mapPartitionFunction(env);// firstFunction(env); flatMapFunction(env); }public static void flatMapFunction(ExecutionEnvironment env) throws Exception { List<String> info = Arrays.asList("hadoop,spark","hadoop,flink","flink,flink"); DataSource<String> data = env.fromCollection(info); data.flatMap((FlatMapFunction<String,String>)(value, collector) -> { String tokens[] = value.split(","); Stream.of(tokens).forEach(collector::collect); }).returns(String.class).print(); }public static void firstFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"), new Tuple2<>(1,"Spark"), new Tuple2<>(1,"Flink"), new Tuple2<>(2,"Java"), new Tuple2<>(2,"Spring boot"), new Tuple2<>(3,"Linux"), new Tuple2<>(4,"VUE")); DataSource<Tuple2<Integer, String>> data = env.fromCollection(info); data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print(); }public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception { List<String> students = new ArrayList<>(); for (int i = 0; i < 100; i++) { students.add("student: " + i); } DataSource<String> data = env.fromCollection(students).setParallelism(4); //此處會按照students的數(shù)量進(jìn)行轉(zhuǎn)換 data.map(student -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); return connection; }).print(); System.out.println("-----------"); //此處會按照并行度的數(shù)量進(jìn)行轉(zhuǎn)換 data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); collector.collect(connection); }).returns(Integer.class).print(); }public static void filterFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).filter(x -> x > 5).print(); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志)
hadoop spark hadoop flink flink flink
Scala代碼
import com.guanjian.flink.scala.until.DBUntilsimport org.apache.flink.api.common.operators.Orderimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.util.Collectorimport scala.collection.mutable.ListBufferobject DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// mapFunction(env)// filterFunction(env)// mapPartitionFunction(env)// firstFunction(env) flatMapFunction(env) } def flatMapFunction(env: ExecutionEnvironment): Unit = {val info = List("hadoop,spark","hadoop,flink","flink,flink")val data = env.fromCollection(info) data.flatMap(_.split(",")).print() } def firstFunction(env: ExecutionEnvironment): Unit = {val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"), (2,"Spring boot"),(3,"Linux"),(4,"VUE"))val data = env.fromCollection(info) data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print() } def mapPartitionFunction(env: ExecutionEnvironment): Unit = {val students = new ListBuffer[String]for(i <- 1 to 100) { students.append("student: " + i) }val data = env.fromCollection(students).setParallelism(4) data.map(student => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) connection }).print()println("-----------") data.mapPartition((student,collector: Collector[Int]) => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) collector.collect(connection) }).print() } def filterFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).filter(_ > 5).print() } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果(省略日志)
hadoop spark hadoop flink flink flink
當(dāng)然它也支持跟Java同樣的寫法
def flatMapFunction(env: ExecutionEnvironment): Unit = { val info = List("hadoop,spark","hadoop,flink","flink,flink") val data = env.fromCollection(info) data.flatMap((value,collector: Collector[String]) => {val tokens = value.split(",") tokens.foreach(collector.collect(_)) }).print() }
統(tǒng)計(jì)單詞數(shù)量
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// mapFunction(env);// filterFunction(env);// mapPartitionFunction(env);// firstFunction(env); flatMapFunction(env); }public static void flatMapFunction(ExecutionEnvironment env) throws Exception { List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink"); DataSource<String> data = env.fromCollection(info); data.flatMap((FlatMapFunction<String, String>) (value, collector) -> { String tokens[] = value.split(","); Stream.of(tokens).forEach(collector::collect); }).returns(String.class) .map(new MapFunction<String, Tuple2<String,Integer>>() {@Override public Tuple2<String, Integer> map(String value) throws Exception {return new Tuple2<>(value,1); } }) .groupBy(0).sum(1).print(); }public static void firstFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"), new Tuple2<>(1,"Spark"), new Tuple2<>(1,"Flink"), new Tuple2<>(2,"Java"), new Tuple2<>(2,"Spring boot"), new Tuple2<>(3,"Linux"), new Tuple2<>(4,"VUE")); DataSource<Tuple2<Integer, String>> data = env.fromCollection(info); data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print(); }public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception { List<String> students = new ArrayList<>(); for (int i = 0; i < 100; i++) { students.add("student: " + i); } DataSource<String> data = env.fromCollection(students).setParallelism(4); //此處會按照students的數(shù)量進(jìn)行轉(zhuǎn)換 data.map(student -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); return connection; }).print(); System.out.println("-----------"); //此處會按照并行度的數(shù)量進(jìn)行轉(zhuǎn)換 data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); collector.collect(connection); }).returns(Integer.class).print(); }public static void filterFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).filter(x -> x > 5).print(); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志)
(hadoop,2) (flink,3) (spark,1)
Scala代碼
import com.guanjian.flink.scala.until.DBUntilsimport org.apache.flink.api.common.operators.Orderimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.util.Collectorimport scala.collection.mutable.ListBufferobject DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// mapFunction(env)// filterFunction(env)// mapPartitionFunction(env)// firstFunction(env) flatMapFunction(env) } def flatMapFunction(env: ExecutionEnvironment): Unit = {val info = List("hadoop,spark","hadoop,flink","flink,flink")val data = env.fromCollection(info) data.flatMap(_.split(",")).map((_,1)).groupBy(0) .sum(1).print() } def firstFunction(env: ExecutionEnvironment): Unit = {val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"), (2,"Spring boot"),(3,"Linux"),(4,"VUE"))val data = env.fromCollection(info) data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print() } def mapPartitionFunction(env: ExecutionEnvironment): Unit = {val students = new ListBuffer[String]for(i <- 1 to 100) { students.append("student: " + i) }val data = env.fromCollection(students).setParallelism(4) data.map(student => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) connection }).print()println("-----------") data.mapPartition((student,collector: Collector[Int]) => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) collector.collect(connection) }).print() } def filterFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).filter(_ > 5).print() } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果(省略日志)
(hadoop,2) (flink,3) (spark,1)
distinct算子
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// mapFunction(env);// filterFunction(env);// mapPartitionFunction(env);// firstFunction(env);// flatMapFunction(env); distinctFunction(env); }public static void distinctFunction(ExecutionEnvironment env) throws Exception { List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink"); DataSource<String> data = env.fromCollection(info); data.flatMap((FlatMapFunction<String, String>) (value, collector) -> { String tokens[] = value.split(","); Stream.of(tokens).forEach(collector::collect); }).returns(String.class) .distinct().print(); }public static void flatMapFunction(ExecutionEnvironment env) throws Exception { List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink"); DataSource<String> data = env.fromCollection(info); data.flatMap((FlatMapFunction<String, String>) (value, collector) -> { String tokens[] = value.split(","); Stream.of(tokens).forEach(collector::collect); }).returns(String.class) .map(new MapFunction<String, Tuple2<String,Integer>>() {@Override public Tuple2<String, Integer> map(String value) throws Exception {return new Tuple2<>(value,1); } }) .groupBy(0).sum(1).print(); }public static void firstFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"), new Tuple2<>(1,"Spark"), new Tuple2<>(1,"Flink"), new Tuple2<>(2,"Java"), new Tuple2<>(2,"Spring boot"), new Tuple2<>(3,"Linux"), new Tuple2<>(4,"VUE")); DataSource<Tuple2<Integer, String>> data = env.fromCollection(info); data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print(); }public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception { List<String> students = new ArrayList<>(); for (int i = 0; i < 100; i++) { students.add("student: " + i); } DataSource<String> data = env.fromCollection(students).setParallelism(4); //此處會按照students的數(shù)量進(jìn)行轉(zhuǎn)換 data.map(student -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); return connection; }).print(); System.out.println("-----------"); //此處會按照并行度的數(shù)量進(jìn)行轉(zhuǎn)換 data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); collector.collect(connection); }).returns(Integer.class).print(); }public static void filterFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).filter(x -> x > 5).print(); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志)
hadoop flink spark
Scala代碼
import com.guanjian.flink.scala.until.DBUntilsimport org.apache.flink.api.common.operators.Orderimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.util.Collectorimport scala.collection.mutable.ListBufferobject DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// mapFunction(env)// filterFunction(env)// mapPartitionFunction(env)// firstFunction(env)// flatMapFunction(env) distinctFunction(env) } def distinctFunction(env: ExecutionEnvironment): Unit = {val info = List("hadoop,spark","hadoop,flink","flink,flink")val data = env.fromCollection(info) data.flatMap(_.split(",")).distinct().print() } def flatMapFunction(env: ExecutionEnvironment): Unit = {val info = List("hadoop,spark","hadoop,flink","flink,flink")val data = env.fromCollection(info) data.flatMap(_.split(",")).map((_,1)).groupBy(0) .sum(1).print() } def firstFunction(env: ExecutionEnvironment): Unit = {val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"), (2,"Spring boot"),(3,"Linux"),(4,"VUE"))val data = env.fromCollection(info) data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print() } def mapPartitionFunction(env: ExecutionEnvironment): Unit = {val students = new ListBuffer[String]for(i <- 1 to 100) { students.append("student: " + i) }val data = env.fromCollection(students).setParallelism(4) data.map(student => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) connection }).print()println("-----------") data.mapPartition((student,collector: Collector[Int]) => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) collector.collect(connection) }).print() } def filterFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).filter(_ > 5).print() } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果(省略日志)
hadoop flink spark
join算子
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// mapFunction(env);// filterFunction(env);// mapPartitionFunction(env);// firstFunction(env);// flatMapFunction(env);// distinctFunction(env); joinFunction(env); }public static void joinFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"), new Tuple2<>(2,"J哥"), new Tuple2<>(3,"小隊(duì)長"), new Tuple2<>(4,"天空藍(lán)")); List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"), new Tuple2<>(2,"上海"), new Tuple2<>(3,"成都"), new Tuple2<>(4,"杭州")); DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1); DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2); data1.join(data2).where(0).equalTo(0) .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {@Override public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1)); } }).print(); }public static void distinctFunction(ExecutionEnvironment env) throws Exception { List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink"); DataSource<String> data = env.fromCollection(info); data.flatMap((FlatMapFunction<String, String>) (value, collector) -> { String tokens[] = value.split(","); Stream.of(tokens).forEach(collector::collect); }).returns(String.class) .distinct().print(); }public static void flatMapFunction(ExecutionEnvironment env) throws Exception { List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink"); DataSource<String> data = env.fromCollection(info); data.flatMap((FlatMapFunction<String, String>) (value, collector) -> { String tokens[] = value.split(","); Stream.of(tokens).forEach(collector::collect); }).returns(String.class) .map(new MapFunction<String, Tuple2<String,Integer>>() {@Override public Tuple2<String, Integer> map(String value) throws Exception {return new Tuple2<>(value,1); } }) .groupBy(0).sum(1).print(); }public static void firstFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"), new Tuple2<>(1,"Spark"), new Tuple2<>(1,"Flink"), new Tuple2<>(2,"Java"), new Tuple2<>(2,"Spring boot"), new Tuple2<>(3,"Linux"), new Tuple2<>(4,"VUE")); DataSource<Tuple2<Integer, String>> data = env.fromCollection(info); data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print(); }public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception { List<String> students = new ArrayList<>(); for (int i = 0; i < 100; i++) { students.add("student: " + i); } DataSource<String> data = env.fromCollection(students).setParallelism(4); //此處會按照students的數(shù)量進(jìn)行轉(zhuǎn)換 data.map(student -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); return connection; }).print(); System.out.println("-----------"); //此處會按照并行度的數(shù)量進(jìn)行轉(zhuǎn)換 data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); collector.collect(connection); }).returns(Integer.class).print(); }public static void filterFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).filter(x -> x > 5).print(); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志)
(3,小隊(duì)長,成都) (1,PK哥,北京) (4,天空藍(lán),杭州) (2,J哥,上海)
Scala代碼
import com.guanjian.flink.scala.until.DBUntilsimport org.apache.flink.api.common.operators.Orderimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.util.Collectorimport scala.collection.mutable.ListBufferobject DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// mapFunction(env)// filterFunction(env)// mapPartitionFunction(env)// firstFunction(env)// flatMapFunction(env)// distinctFunction(env) joinFunction(env) } def joinFunction(env: ExecutionEnvironment): Unit = {val info1 = List((1,"PK哥"),(2,"J哥"),(3,"小隊(duì)長"),(4,"天空藍(lán)"))val info2 = List((1,"北京"),(2,"上海"),(3,"成都"),(4,"杭州"))val data1 = env.fromCollection(info1)val data2 = env.fromCollection(info2) data1.join(data2).where(0).equalTo(0).apply((first,second) => (first._1,first._2,second._2) ).print() } def distinctFunction(env: ExecutionEnvironment): Unit = {val info = List("hadoop,spark","hadoop,flink","flink,flink")val data = env.fromCollection(info) data.flatMap(_.split(",")).distinct().print() } def flatMapFunction(env: ExecutionEnvironment): Unit = {val info = List("hadoop,spark","hadoop,flink","flink,flink")val data = env.fromCollection(info) data.flatMap(_.split(",")).map((_,1)).groupBy(0) .sum(1).print() } def firstFunction(env: ExecutionEnvironment): Unit = {val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"), (2,"Spring boot"),(3,"Linux"),(4,"VUE"))val data = env.fromCollection(info) data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print() } def mapPartitionFunction(env: ExecutionEnvironment): Unit = {val students = new ListBuffer[String]for(i <- 1 to 100) { students.append("student: " + i) }val data = env.fromCollection(students).setParallelism(4) data.map(student => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) connection }).print()println("-----------") data.mapPartition((student,collector: Collector[Int]) => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) collector.collect(connection) }).print() } def filterFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).filter(_ > 5).print() } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果(省略日志)
(3,小隊(duì)長,成都) (1,PK哥,北京) (4,天空藍(lán),杭州) (2,J哥,上海)
outJoin算子
左連接
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// mapFunction(env);// filterFunction(env);// mapPartitionFunction(env);// firstFunction(env);// flatMapFunction(env);// distinctFunction(env);// joinFunction(env); outJoinFunction(env); }public static void outJoinFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"), new Tuple2<>(2,"J哥"), new Tuple2<>(3,"小隊(duì)長"), new Tuple2<>(4,"天空藍(lán)")); List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"), new Tuple2<>(2,"上海"), new Tuple2<>(3,"成都"), new Tuple2<>(5,"杭州")); DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1); DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2); data1.leftOuterJoin(data2).where(0).equalTo(0) .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {@Override public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {if (second == null) {return new Tuple3<>(first.getField(0),first.getField(1),"-"); }return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1)); } }).print(); }public static void joinFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"), new Tuple2<>(2,"J哥"), new Tuple2<>(3,"小隊(duì)長"), new Tuple2<>(4,"天空藍(lán)")); List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"), new Tuple2<>(2,"上海"), new Tuple2<>(3,"成都"), new Tuple2<>(4,"杭州")); DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1); DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2); data1.join(data2).where(0).equalTo(0) .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {@Override public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1)); } }).print(); }public static void distinctFunction(ExecutionEnvironment env) throws Exception { List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink"); DataSource<String> data = env.fromCollection(info); data.flatMap((FlatMapFunction<String, String>) (value, collector) -> { String tokens[] = value.split(","); Stream.of(tokens).forEach(collector::collect); }).returns(String.class) .distinct().print(); }public static void flatMapFunction(ExecutionEnvironment env) throws Exception { List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink"); DataSource<String> data = env.fromCollection(info); data.flatMap((FlatMapFunction<String, String>) (value, collector) -> { String tokens[] = value.split(","); Stream.of(tokens).forEach(collector::collect); }).returns(String.class) .map(new MapFunction<String, Tuple2<String,Integer>>() {@Override public Tuple2<String, Integer> map(String value) throws Exception {return new Tuple2<>(value,1); } }) .groupBy(0).sum(1).print(); }public static void firstFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"), new Tuple2<>(1,"Spark"), new Tuple2<>(1,"Flink"), new Tuple2<>(2,"Java"), new Tuple2<>(2,"Spring boot"), new Tuple2<>(3,"Linux"), new Tuple2<>(4,"VUE")); DataSource<Tuple2<Integer, String>> data = env.fromCollection(info); data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print(); }public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception { List<String> students = new ArrayList<>(); for (int i = 0; i < 100; i++) { students.add("student: " + i); } DataSource<String> data = env.fromCollection(students).setParallelism(4); //此處會按照students的數(shù)量進(jìn)行轉(zhuǎn)換 data.map(student -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); return connection; }).print(); System.out.println("-----------"); //此處會按照并行度的數(shù)量進(jìn)行轉(zhuǎn)換 data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); collector.collect(connection); }).returns(Integer.class).print(); }public static void filterFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).filter(x -> x > 5).print(); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志)
(3,小隊(duì)長,成都) (1,PK哥,北京) (4,天空藍(lán),-) (2,J哥,上海)
Scala代碼
import com.guanjian.flink.scala.until.DBUntilsimport org.apache.flink.api.common.operators.Orderimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.util.Collectorimport scala.collection.mutable.ListBufferobject DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// mapFunction(env)// filterFunction(env)// mapPartitionFunction(env)// firstFunction(env)// flatMapFunction(env)// distinctFunction(env) joinFunction(env) } def joinFunction(env: ExecutionEnvironment): Unit = {val info1 = List((1,"PK哥"),(2,"J哥"),(3,"小隊(duì)長"),(4,"天空藍(lán)"))val info2 = List((1,"北京"),(2,"上海"),(3,"成都"),(5,"杭州"))val data1 = env.fromCollection(info1)val data2 = env.fromCollection(info2) data1.leftOuterJoin(data2).where(0).equalTo(0).apply((first,second) => { if (second == null) { (first._1,first._2,"-") }else { (first._1,first._2,second._2) } }).print() } def distinctFunction(env: ExecutionEnvironment): Unit = {val info = List("hadoop,spark","hadoop,flink","flink,flink")val data = env.fromCollection(info) data.flatMap(_.split(",")).distinct().print() } def flatMapFunction(env: ExecutionEnvironment): Unit = {val info = List("hadoop,spark","hadoop,flink","flink,flink")val data = env.fromCollection(info) data.flatMap(_.split(",")).map((_,1)).groupBy(0) .sum(1).print() } def firstFunction(env: ExecutionEnvironment): Unit = {val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"), (2,"Spring boot"),(3,"Linux"),(4,"VUE"))val data = env.fromCollection(info) data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print() } def mapPartitionFunction(env: ExecutionEnvironment): Unit = {val students = new ListBuffer[String]for(i <- 1 to 100) { students.append("student: " + i) }val data = env.fromCollection(students).setParallelism(4) data.map(student => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) connection }).print()println("-----------") data.mapPartition((student,collector: Collector[Int]) => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) collector.collect(connection) }).print() } def filterFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).filter(_ > 5).print() } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果(省略日志)
(3,小隊(duì)長,成都) (1,PK哥,北京) (4,天空藍(lán),-) (2,J哥,上海)
右連接
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// mapFunction(env);// filterFunction(env);// mapPartitionFunction(env);// firstFunction(env);// flatMapFunction(env);// distinctFunction(env);// joinFunction(env); outJoinFunction(env); }public static void outJoinFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"), new Tuple2<>(2,"J哥"), new Tuple2<>(3,"小隊(duì)長"), new Tuple2<>(4,"天空藍(lán)")); List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"), new Tuple2<>(2,"上海"), new Tuple2<>(3,"成都"), new Tuple2<>(5,"杭州")); DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1); DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2); data1.rightOuterJoin(data2).where(0).equalTo(0) .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {@Override public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {if (first == null) {return new Tuple3<>(second.getField(0),"-",second.getField(1)); }return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1)); } }).print(); }public static void joinFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"), new Tuple2<>(2,"J哥"), new Tuple2<>(3,"小隊(duì)長"), new Tuple2<>(4,"天空藍(lán)")); List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"), new Tuple2<>(2,"上海"), new Tuple2<>(3,"成都"), new Tuple2<>(4,"杭州")); DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1); DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2); data1.join(data2).where(0).equalTo(0) .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {@Override public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1)); } }).print(); }public static void distinctFunction(ExecutionEnvironment env) throws Exception { List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink"); DataSource<String> data = env.fromCollection(info); data.flatMap((FlatMapFunction<String, String>) (value, collector) -> { String tokens[] = value.split(","); Stream.of(tokens).forEach(collector::collect); }).returns(String.class) .distinct().print(); }public static void flatMapFunction(ExecutionEnvironment env) throws Exception { List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink"); DataSource<String> data = env.fromCollection(info); data.flatMap((FlatMapFunction<String, String>) (value, collector) -> { String tokens[] = value.split(","); Stream.of(tokens).forEach(collector::collect); }).returns(String.class) .map(new MapFunction<String, Tuple2<String,Integer>>() {@Override public Tuple2<String, Integer> map(String value) throws Exception {return new Tuple2<>(value,1); } }) .groupBy(0).sum(1).print(); }public static void firstFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"), new Tuple2<>(1,"Spark"), new Tuple2<>(1,"Flink"), new Tuple2<>(2,"Java"), new Tuple2<>(2,"Spring boot"), new Tuple2<>(3,"Linux"), new Tuple2<>(4,"VUE")); DataSource<Tuple2<Integer, String>> data = env.fromCollection(info); data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print(); }public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception { List<String> students = new ArrayList<>(); for (int i = 0; i < 100; i++) { students.add("student: " + i); } DataSource<String> data = env.fromCollection(students).setParallelism(4); //此處會按照students的數(shù)量進(jìn)行轉(zhuǎn)換 data.map(student -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); return connection; }).print(); System.out.println("-----------"); //此處會按照并行度的數(shù)量進(jìn)行轉(zhuǎn)換 data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); collector.collect(connection); }).returns(Integer.class).print(); }public static void filterFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).filter(x -> x > 5).print(); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志)
(3,小隊(duì)長,成都) (1,PK哥,北京) (5,-,杭州) (2,J哥,上海)
Scala代碼
import com.guanjian.flink.scala.until.DBUntilsimport org.apache.flink.api.common.operators.Orderimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.util.Collectorimport scala.collection.mutable.ListBufferobject DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// mapFunction(env)// filterFunction(env)// mapPartitionFunction(env)// firstFunction(env)// flatMapFunction(env)// distinctFunction(env) joinFunction(env) } def joinFunction(env: ExecutionEnvironment): Unit = {val info1 = List((1,"PK哥"),(2,"J哥"),(3,"小隊(duì)長"),(4,"天空藍(lán)"))val info2 = List((1,"北京"),(2,"上海"),(3,"成都"),(5,"杭州"))val data1 = env.fromCollection(info1)val data2 = env.fromCollection(info2) data1.rightOuterJoin(data2).where(0).equalTo(0).apply((first,second) => { if (first == null) { (second._1,"-",second._2) }else { (first._1,first._2,second._2) } }).print() } def distinctFunction(env: ExecutionEnvironment): Unit = {val info = List("hadoop,spark","hadoop,flink","flink,flink")val data = env.fromCollection(info) data.flatMap(_.split(",")).distinct().print() } def flatMapFunction(env: ExecutionEnvironment): Unit = {val info = List("hadoop,spark","hadoop,flink","flink,flink")val data = env.fromCollection(info) data.flatMap(_.split(",")).map((_,1)).groupBy(0) .sum(1).print() } def firstFunction(env: ExecutionEnvironment): Unit = {val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"), (2,"Spring boot"),(3,"Linux"),(4,"VUE"))val data = env.fromCollection(info) data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print() } def mapPartitionFunction(env: ExecutionEnvironment): Unit = {val students = new ListBuffer[String]for(i <- 1 to 100) { students.append("student: " + i) }val data = env.fromCollection(students).setParallelism(4) data.map(student => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) connection }).print()println("-----------") data.mapPartition((student,collector: Collector[Int]) => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) collector.collect(connection) }).print() } def filterFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).filter(_ > 5).print() } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果(省略日志)
(3,小隊(duì)長,成都) (1,PK哥,北京) (5,-,杭州) (2,J哥,上海)
全外連接
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// mapFunction(env);// filterFunction(env);// mapPartitionFunction(env);// firstFunction(env);// flatMapFunction(env);// distinctFunction(env);// joinFunction(env); outJoinFunction(env); }public static void outJoinFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"), new Tuple2<>(2,"J哥"), new Tuple2<>(3,"小隊(duì)長"), new Tuple2<>(4,"天空藍(lán)")); List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"), new Tuple2<>(2,"上海"), new Tuple2<>(3,"成都"), new Tuple2<>(5,"杭州")); DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1); DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2); data1.fullOuterJoin(data2).where(0).equalTo(0) .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {@Override public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {if (first == null) {return new Tuple3<>(second.getField(0),"-",second.getField(1)); }else if (second == null) {return new Tuple3<>(first.getField(0),first.getField(1),"-"); }return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1)); } }).print(); }public static void joinFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"), new Tuple2<>(2,"J哥"), new Tuple2<>(3,"小隊(duì)長"), new Tuple2<>(4,"天空藍(lán)")); List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"), new Tuple2<>(2,"上海"), new Tuple2<>(3,"成都"), new Tuple2<>(4,"杭州")); DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1); DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2); data1.join(data2).where(0).equalTo(0) .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {@Override public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1)); } }).print(); }public static void distinctFunction(ExecutionEnvironment env) throws Exception { List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink"); DataSource<String> data = env.fromCollection(info); data.flatMap((FlatMapFunction<String, String>) (value, collector) -> { String tokens[] = value.split(","); Stream.of(tokens).forEach(collector::collect); }).returns(String.class) .distinct().print(); }public static void flatMapFunction(ExecutionEnvironment env) throws Exception { List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink"); DataSource<String> data = env.fromCollection(info); data.flatMap((FlatMapFunction<String, String>) (value, collector) -> { String tokens[] = value.split(","); Stream.of(tokens).forEach(collector::collect); }).returns(String.class) .map(new MapFunction<String, Tuple2<String,Integer>>() {@Override public Tuple2<String, Integer> map(String value) throws Exception {return new Tuple2<>(value,1); } }) .groupBy(0).sum(1).print(); }public static void firstFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"), new Tuple2<>(1,"Spark"), new Tuple2<>(1,"Flink"), new Tuple2<>(2,"Java"), new Tuple2<>(2,"Spring boot"), new Tuple2<>(3,"Linux"), new Tuple2<>(4,"VUE")); DataSource<Tuple2<Integer, String>> data = env.fromCollection(info); data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print(); }public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception { List<String> students = new ArrayList<>(); for (int i = 0; i < 100; i++) { students.add("student: " + i); } DataSource<String> data = env.fromCollection(students).setParallelism(4); //此處會按照students的數(shù)量進(jìn)行轉(zhuǎn)換 data.map(student -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); return connection; }).print(); System.out.println("-----------"); //此處會按照并行度的數(shù)量進(jìn)行轉(zhuǎn)換 data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); collector.collect(connection); }).returns(Integer.class).print(); }public static void filterFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).filter(x -> x > 5).print(); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志)
(3,小隊(duì)長,成都) (1,PK哥,北京) (4,天空藍(lán),-) (5,-,杭州) (2,J哥,上海)
Scala代碼
import com.guanjian.flink.scala.until.DBUntilsimport org.apache.flink.api.common.operators.Orderimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.util.Collectorimport scala.collection.mutable.ListBufferobject DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// mapFunction(env)// filterFunction(env)// mapPartitionFunction(env)// firstFunction(env)// flatMapFunction(env)// distinctFunction(env) joinFunction(env) } def joinFunction(env: ExecutionEnvironment): Unit = {val info1 = List((1,"PK哥"),(2,"J哥"),(3,"小隊(duì)長"),(4,"天空藍(lán)"))val info2 = List((1,"北京"),(2,"上海"),(3,"成都"),(5,"杭州"))val data1 = env.fromCollection(info1)val data2 = env.fromCollection(info2) data1.fullOuterJoin(data2).where(0).equalTo(0).apply((first,second) => { if (first == null) { (second._1,"-",second._2) }else if (second == null) { (first._1,first._2,"-") }else { (first._1,first._2,second._2) } }).print() } def distinctFunction(env: ExecutionEnvironment): Unit = {val info = List("hadoop,spark","hadoop,flink","flink,flink")val data = env.fromCollection(info) data.flatMap(_.split(",")).distinct().print() } def flatMapFunction(env: ExecutionEnvironment): Unit = {val info = List("hadoop,spark","hadoop,flink","flink,flink")val data = env.fromCollection(info) data.flatMap(_.split(",")).map((_,1)).groupBy(0) .sum(1).print() } def firstFunction(env: ExecutionEnvironment): Unit = {val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"), (2,"Spring boot"),(3,"Linux"),(4,"VUE"))val data = env.fromCollection(info) data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print() } def mapPartitionFunction(env: ExecutionEnvironment): Unit = {val students = new ListBuffer[String]for(i <- 1 to 100) { students.append("student: " + i) }val data = env.fromCollection(students).setParallelism(4) data.map(student => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) connection }).print()println("-----------") data.mapPartition((student,collector: Collector[Int]) => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) collector.collect(connection) }).print() } def filterFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).filter(_ > 5).print() } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果(省略日志)
(3,小隊(duì)長,成都) (1,PK哥,北京) (4,天空藍(lán),-) (5,-,杭州) (2,J哥,上海)
cross算子
笛卡爾積
public class DataSetTransformationApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();// mapFunction(env);// filterFunction(env);// mapPartitionFunction(env);// firstFunction(env);// flatMapFunction(env);// distinctFunction(env);// joinFunction(env);// outJoinFunction(env); crossFunction(env); }public static void crossFunction(ExecutionEnvironment env) throws Exception { List<String> info1 = Arrays.asList("曼聯(lián)","曼城"); List<Integer> info2 = Arrays.asList(3,1,0); DataSource<String> data1 = env.fromCollection(info1); DataSource<Integer> data2 = env.fromCollection(info2); data1.cross(data2).print(); }public static void outJoinFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"), new Tuple2<>(2,"J哥"), new Tuple2<>(3,"小隊(duì)長"), new Tuple2<>(4,"天空藍(lán)")); List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"), new Tuple2<>(2,"上海"), new Tuple2<>(3,"成都"), new Tuple2<>(5,"杭州")); DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1); DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2); data1.fullOuterJoin(data2).where(0).equalTo(0) .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {@Override public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {if (first == null) {return new Tuple3<>(second.getField(0),"-",second.getField(1)); }else if (second == null) {return new Tuple3<>(first.getField(0),first.getField(1),"-"); }return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1)); } }).print(); }public static void joinFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"), new Tuple2<>(2,"J哥"), new Tuple2<>(3,"小隊(duì)長"), new Tuple2<>(4,"天空藍(lán)")); List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"), new Tuple2<>(2,"上海"), new Tuple2<>(3,"成都"), new Tuple2<>(4,"杭州")); DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1); DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2); data1.join(data2).where(0).equalTo(0) .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {@Override public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1)); } }).print(); }public static void distinctFunction(ExecutionEnvironment env) throws Exception { List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink"); DataSource<String> data = env.fromCollection(info); data.flatMap((FlatMapFunction<String, String>) (value, collector) -> { String tokens[] = value.split(","); Stream.of(tokens).forEach(collector::collect); }).returns(String.class) .distinct().print(); }public static void flatMapFunction(ExecutionEnvironment env) throws Exception { List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink"); DataSource<String> data = env.fromCollection(info); data.flatMap((FlatMapFunction<String, String>) (value, collector) -> { String tokens[] = value.split(","); Stream.of(tokens).forEach(collector::collect); }).returns(String.class) .map(new MapFunction<String, Tuple2<String,Integer>>() {@Override public Tuple2<String, Integer> map(String value) throws Exception {return new Tuple2<>(value,1); } }) .groupBy(0).sum(1).print(); }public static void firstFunction(ExecutionEnvironment env) throws Exception { List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"), new Tuple2<>(1,"Spark"), new Tuple2<>(1,"Flink"), new Tuple2<>(2,"Java"), new Tuple2<>(2,"Spring boot"), new Tuple2<>(3,"Linux"), new Tuple2<>(4,"VUE")); DataSource<Tuple2<Integer, String>> data = env.fromCollection(info); data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print(); }public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception { List<String> students = new ArrayList<>(); for (int i = 0; i < 100; i++) { students.add("student: " + i); } DataSource<String> data = env.fromCollection(students).setParallelism(4); //此處會按照students的數(shù)量進(jìn)行轉(zhuǎn)換 data.map(student -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); return connection; }).print(); System.out.println("-----------"); //此處會按照并行度的數(shù)量進(jìn)行轉(zhuǎn)換 data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {int connection = DBUntils.getConnection(); //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection); collector.collect(connection); }).returns(Integer.class).print(); }public static void filterFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).filter(x -> x > 5).print(); }public static void mapFunction(ExecutionEnvironment env) throws Exception { DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10)); data.map(x -> x + 1).print(); } }
運(yùn)行結(jié)果(省略日志)
(曼聯(lián),3) (曼聯(lián),1) (曼聯(lián),0) (曼城,3) (曼城,1) (曼城,0)
Scala代碼
import com.guanjian.flink.scala.until.DBUntilsimport org.apache.flink.api.common.operators.Orderimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.util.Collectorimport scala.collection.mutable.ListBufferobject DataSetTransformationApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment// mapFunction(env)// filterFunction(env)// mapPartitionFunction(env)// firstFunction(env)// flatMapFunction(env)// distinctFunction(env)// joinFunction(env) crossFunction(env) } def crossFunction(env: ExecutionEnvironment): Unit = {val info1 = List("曼聯(lián)","曼城")val info2 = List(3,1,0)val data1 = env.fromCollection(info1)val data2 = env.fromCollection(info2) data1.cross(data2).print() } def joinFunction(env: ExecutionEnvironment): Unit = {val info1 = List((1,"PK哥"),(2,"J哥"),(3,"小隊(duì)長"),(4,"天空藍(lán)"))val info2 = List((1,"北京"),(2,"上海"),(3,"成都"),(5,"杭州"))val data1 = env.fromCollection(info1)val data2 = env.fromCollection(info2) data1.fullOuterJoin(data2).where(0).equalTo(0).apply((first,second) => { if (first == null) { (second._1,"-",second._2) }else if (second == null) { (first._1,first._2,"-") }else { (first._1,first._2,second._2) } }).print() } def distinctFunction(env: ExecutionEnvironment): Unit = {val info = List("hadoop,spark","hadoop,flink","flink,flink")val data = env.fromCollection(info) data.flatMap(_.split(",")).distinct().print() } def flatMapFunction(env: ExecutionEnvironment): Unit = {val info = List("hadoop,spark","hadoop,flink","flink,flink")val data = env.fromCollection(info) data.flatMap(_.split(",")).map((_,1)).groupBy(0) .sum(1).print() } def firstFunction(env: ExecutionEnvironment): Unit = {val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"), (2,"Spring boot"),(3,"Linux"),(4,"VUE"))val data = env.fromCollection(info) data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print() } def mapPartitionFunction(env: ExecutionEnvironment): Unit = {val students = new ListBuffer[String]for(i <- 1 to 100) { students.append("student: " + i) }val data = env.fromCollection(students).setParallelism(4) data.map(student => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) connection }).print()println("-----------") data.mapPartition((student,collector: Collector[Int]) => { val connection = DBUntils.getConnection() //TODO 數(shù)據(jù)庫操作 DBUntils.returnConnection(connection) collector.collect(connection) }).print() } def filterFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).filter(_ > 5).print() } def mapFunction(env: ExecutionEnvironment): Unit = {val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10)) data.map(_ + 1).print() } }
運(yùn)行結(jié)果(省略日志)
(曼聯(lián),3) (曼聯(lián),1) (曼聯(lián),0) (曼城,3) (曼城,1) (曼城,0)
Sink(輸出)
我們在flink文件夾下面新增一個sink-out的文件夾,此時文件夾為空
輸出成文本文件
public class DataSetSinkApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); List<Integer> data = new ArrayList<>(); for (int i = 1; i <= 10; i++) { data.add(i); } DataSource<Integer> text = env.fromCollection(data); String filePath = "/Users/admin/Downloads/flink/sink-out/sinkjava/"; text.writeAsText(filePath); env.execute("DataSetSinkApp"); } }
運(yùn)行結(jié)果
進(jìn)入sink-out文件夾
Scala代碼
import org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._object DataSetSinkApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment val data = 1 to 10 val text = env.fromCollection(data)val filePath = "/Users/admin/Downloads/flink/sink-out/sinkscala/" text.writeAsText(filePath) env.execute("DataSetSinkApp") } }
運(yùn)行結(jié)果
如果此時我們再次運(yùn)行代碼就會報錯,因?yàn)檩敵鑫募呀?jīng)存在,如果要覆蓋該文件,則需要調(diào)整代碼
public class DataSetSinkApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); List<Integer> data = new ArrayList<>(); for (int i = 1; i <= 10; i++) { data.add(i); } DataSource<Integer> text = env.fromCollection(data); String filePath = "/Users/admin/Downloads/flink/sink-out/sinkjava/"; text.writeAsText(filePath, FileSystem.WriteMode.OVERWRITE); env.execute("DataSetSinkApp"); } }
Scala代碼
import org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.core.fs.FileSystem.WriteModeobject DataSetSinkApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment val data = 1 to 10 val text = env.fromCollection(data)val filePath = "/Users/admin/Downloads/flink/sink-out/sinkscala/" text.writeAsText(filePath,WriteMode.OVERWRITE) env.execute("DataSetSinkApp") } }
增加并行度,輸出多個文件
public class DataSetSinkApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); List<Integer> data = new ArrayList<>(); for (int i = 1; i <= 10; i++) { data.add(i); } DataSource<Integer> text = env.fromCollection(data); String filePath = "/Users/admin/Downloads/flink/sink-out/sinkjava/"; text.writeAsText(filePath, FileSystem.WriteMode.OVERWRITE).setParallelism(4); env.execute("DataSetSinkApp"); } }
運(yùn)行結(jié)果
此時我們可以看到sinkjava變成了一個文件夾,而該文件夾下面有4個文件
Scala代碼
import org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.core.fs.FileSystem.WriteModeobject DataSetSinkApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment val data = 1 to 10 val text = env.fromCollection(data)val filePath = "/Users/admin/Downloads/flink/sink-out/sinkscala/" text.writeAsText(filePath,WriteMode.OVERWRITE).setParallelism(4) env.execute("DataSetSinkApp") } }
運(yùn)行結(jié)果
計(jì)數(shù)器
public class CounterApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSource<String> data = env.fromElements("hadoop", "spark", "flink", "pyspark", "storm"); String filePath = "/Users/admin/Downloads/flink/sink-out/sink-java-counter/"; data.map(new RichMapFunction<String, String>() { LongCounter counter = new LongCounter(); @Override public void open(Configuration parameters) throws Exception {super.open(parameters); getRuntimeContext().addAccumulator("ele-counts-java", counter); }@Override public String map(String value) throws Exception {counter.add(1); return value; } }).writeAsText(filePath, FileSystem.WriteMode.OVERWRITE).setParallelism(4); JobExecutionResult jobResult = env.execute("CounterApp"); Long num = jobResult.getAccumulatorResult("ele-counts-java"); System.out.println("num: " + num); } }
運(yùn)行結(jié)果(省略日志)
num: 5
Scala代碼
import org.apache.flink.api.common.accumulators.LongCounterimport org.apache.flink.api.common.functions.RichMapFunctionimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.configuration.Configurationimport org.apache.flink.core.fs.FileSystem.WriteModeobject CounterApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment val data = env.fromElements("hadoop", "spark", "flink", "pyspark", "storm")val filePath = "/Users/admin/Downloads/flink/sink-out/sink-scala-counter/" data.map(new RichMapFunction[String,String]() { val counter = new LongCounter override def open(parameters: Configuration): Unit = { getRuntimeContext.addAccumulator("ele-counts-scala", counter) } override def map(value: String) = {counter.add(1) value } }).writeAsText(filePath,WriteMode.OVERWRITE).setParallelism(4)val jobResult = env.execute("CounterApp")val num = jobResult.getAccumulatorResult[Long]("ele-counts-scala")println("num: " + num) } }
運(yùn)行結(jié)果(省略日志)
num: 5
分布式緩存
public class DistriutedCacheApp {public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); String filePath = "/Users/admin/Downloads/flink/data/hello.txt"; env.registerCachedFile(filePath,"pk-java-dc"); DataSource<String> data = env.fromElements("hadoop", "spark", "flink", "pyspark", "storm"); data.map(new RichMapFunction<String, String>() {@Override public void open(Configuration parameters) throws Exception {super.open(parameters); File dcFile = getRuntimeContext().getDistributedCache().getFile("pk-java-dc"); List<String> lines = FileUtils.readLines(dcFile); lines.forEach(System.out::println); }@Override public String map(String value) throws Exception {return value; } }).print(); } }
運(yùn)行結(jié)果(省略日志)
hello world welcome hello welcome hadoop spark flink pyspark storm
Scala代碼
import org.apache.commons.io.FileUtilsimport org.apache.flink.api.common.functions.RichMapFunctionimport org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.configuration.Configurationimport scala.collection.JavaConverters._object DistriutedCacheApp { def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment val filePath = "/Users/admin/Downloads/flink/data/hello.txt" env.registerCachedFile(filePath,"pk-scala-dc")val data = env.fromElements("hadoop", "spark", "flink", "pyspark", "storm") data.map(new RichMapFunction[String,String] { override def open(parameters: Configuration): Unit = {val dcFile = getRuntimeContext.getDistributedCache.getFile("pk-scala-dc")val lines = FileUtils.readLines(dcFile) lines.asScala.foreach(println(_)) } override def map(value: String) = { value } }) }.print() }
運(yùn)行結(jié)果(省略日志)
hello world welcome hello welcome hadoop spark flink pyspark storm
socket
public class DataStreamSourceApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); socketFunction(env); env.execute("DataStreamSourceApp"); }public static void socketFunction(StreamExecutionEnvironment env) { DataStreamSource<String> data = env.socketTextStream("127.0.0.1", 9999); data.print().setParallelism(1); } }
運(yùn)行前執(zhí)行控制臺
nc -lk 9999
執(zhí)行后,在控制臺輸入
運(yùn)行結(jié)果(省略日志)
hello world welcome hello welcome hadoop spark flink pyspark storm
Scala代碼
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentobject DataStreamSourceApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment socketFunction(env) env.execute("DataStreamSourceApp") } def socketFunction(env: StreamExecutionEnvironment): Unit = {val data = env.socketTextStream("127.0.0.1",9999) data.print().setParallelism(1) } }
運(yùn)行結(jié)果(省略日志)
hello world welcome hello welcome hadoop spark flink pyspark storm
自定義數(shù)據(jù)源
不可并行數(shù)據(jù)源
public class CustomNonParallelSourceFunction implements SourceFunction<Long> {private boolean isRunning = true; private long count = 1; @Override public void run(SourceContext<Long> ctx) throws Exception {while (isRunning) { ctx.collect(count); count++; Thread.sleep(1000); } }@Override public void cancel() {isRunning = false; } }
public class DataStreamSourceApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();// socketFunction(env); nonParallelSourceFunction(env); env.execute("DataStreamSourceApp"); }public static void nonParallelSourceFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction()); data.print().setParallelism(1); }public static void socketFunction(StreamExecutionEnvironment env) { DataStreamSource<String> data = env.socketTextStream("127.0.0.1", 9999); data.print().setParallelism(1); } }
運(yùn)行結(jié)果(省略日志,每隔1秒打印一次)
1 2 3 4 5 6 . . .
因?yàn)槭遣豢刹⑿校绻覀冋{(diào)大并行度則會報錯,如
public class DataStreamSourceApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();// socketFunction(env); nonParallelSourceFunction(env); env.execute("DataStreamSourceApp"); }public static void nonParallelSourceFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction()) .setParallelism(2); data.print().setParallelism(1); }public static void socketFunction(StreamExecutionEnvironment env) { DataStreamSource<String> data = env.socketTextStream("127.0.0.1", 9999); data.print().setParallelism(1); } }
結(jié)果報錯
Exception in thread "main" java.lang.IllegalArgumentException: Source: 1 is not a parallel source at org.apache.flink.streaming.api.datastream.DataStreamSource.setParallelism(DataStreamSource.java:55) at com.guanjian.flink.java.test.DataStreamSourceApp.nonParallelSourceFunction(DataStreamSourceApp.java:17) at com.guanjian.flink.java.test.DataStreamSourceApp.main(DataStreamSourceApp.java:11)
Scala代碼
import org.apache.flink.streaming.api.functions.source.SourceFunctionclass CustomNonParallelSourceFunction extends SourceFunction[Long] { private var isRunning = true private var count = 1L override def cancel(): Unit = {isRunning = false } override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {while (isRunning) { ctx.collect(count) count += 1 Thread.sleep(1000) } } }
import com.guanjian.flink.scala.until.CustomNonParallelSourceFunctionimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.api.scala._object DataStreamSourceApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment// socketFunction(env) nonParallelSourceFunction(env) env.execute("DataStreamSourceApp") } def nonParallelSourceFunction(env: StreamExecutionEnvironment): Unit = {val data = env.addSource(new CustomNonParallelSourceFunction) data.print().setParallelism(1) } def socketFunction(env: StreamExecutionEnvironment): Unit = {val data = env.socketTextStream("127.0.0.1",9999) data.print().setParallelism(1) } }
運(yùn)行結(jié)果(省略日志,每隔1秒打印一次)
1 2 3 4 5 6 . . .
可并行數(shù)據(jù)源
public class CustomParallelSourceFunction implements ParallelSourceFunction<Long> {private boolean isRunning = true; private long count = 1; @Override public void run(SourceContext<Long> ctx) throws Exception {while (isRunning) { ctx.collect(count); count++; Thread.sleep(1000); } }@Override public void cancel() {isRunning = false; } }
public class DataStreamSourceApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();// socketFunction(env);// nonParallelSourceFunction(env); parallelSourceFunction(env); env.execute("DataStreamSourceApp"); }public static void parallelSourceFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data = env.addSource(new CustomParallelSourceFunction()) .setParallelism(2); data.print().setParallelism(1); }public static void nonParallelSourceFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction()); data.print().setParallelism(1); }public static void socketFunction(StreamExecutionEnvironment env) { DataStreamSource<String> data = env.socketTextStream("127.0.0.1", 9999); data.print().setParallelism(1); } }
運(yùn)行結(jié)果(省略日志,每隔1秒打印一次,每次打印兩條)
1 1 2 2 3 3 4 4 5 5 . . . .
Scala代碼
import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction}class CustomParallelSourceFunction extends ParallelSourceFunction[Long] { private var isRunning = true private var count = 1L override def cancel(): Unit = {isRunning = false } override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {while (isRunning) { ctx.collect(count) count += 1 Thread.sleep(1000) } } }
import com.guanjian.flink.scala.until.{CustomNonParallelSourceFunction, CustomParallelSourceFunction}import org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.api.scala._object DataStreamSourceApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment// socketFunction(env)// nonParallelSourceFunction(env) parallelSourceFunction(env) env.execute("DataStreamSourceApp") } def parallelSourceFunction(env: StreamExecutionEnvironment): Unit = {val data = env.addSource(new CustomParallelSourceFunction) .setParallelism(2) data.print().setParallelism(1) } def nonParallelSourceFunction(env: StreamExecutionEnvironment): Unit = {val data = env.addSource(new CustomNonParallelSourceFunction) data.print().setParallelism(1) } def socketFunction(env: StreamExecutionEnvironment): Unit = {val data = env.socketTextStream("127.0.0.1",9999) data.print().setParallelism(1) } }
運(yùn)行結(jié)果(省略日志,每隔1秒打印一次,每次打印兩條)
1 1 2 2 3 3 4 4 5 5 . . . .
增強(qiáng)數(shù)據(jù)源
public class CustomRichParallelSourceFunction extends RichParallelSourceFunction<Long> {private boolean isRunning = true; private long count = 1; /** * 可以在這里面實(shí)現(xiàn)一些其他需求的代碼 * @param parameters * @throws Exception */ @Override public void open(Configuration parameters) throws Exception {super.open(parameters); }@Override public void close() throws Exception {super.close(); }@Override public void run(SourceContext<Long> ctx) throws Exception {while (isRunning) { ctx.collect(count); count++; Thread.sleep(1000); } }@Override public void cancel() {isRunning = false; } }
public class DataStreamSourceApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();// socketFunction(env);// nonParallelSourceFunction(env);// parallelSourceFunction(env); richParallelSourceFunction(env); env.execute("DataStreamSourceApp"); }public static void richParallelSourceFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data = env.addSource(new CustomRichParallelSourceFunction()) .setParallelism(2); data.print().setParallelism(1); }public static void parallelSourceFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data = env.addSource(new CustomParallelSourceFunction()) .setParallelism(2); data.print().setParallelism(1); }public static void nonParallelSourceFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction()); data.print().setParallelism(1); }public static void socketFunction(StreamExecutionEnvironment env) { DataStreamSource<String> data = env.socketTextStream("127.0.0.1", 9999); data.print().setParallelism(1); } }
運(yùn)行結(jié)果與可并行數(shù)據(jù)源相同
Scala代碼
import org.apache.flink.configuration.Configurationimport org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}class CustomRichParallelSourceFunction extends RichParallelSourceFunction[Long] { private var isRunning = true private var count = 1L override def open(parameters: Configuration): Unit = { } override def close(): Unit = { } override def cancel(): Unit = {isRunning = false } override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {while (isRunning) { ctx.collect(count) count += 1 Thread.sleep(1000) } } }
import com.guanjian.flink.scala.until.{CustomNonParallelSourceFunction, CustomParallelSourceFunction, CustomRichParallelSourceFunction}import org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.api.scala._object DataStreamSourceApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment// socketFunction(env)// nonParallelSourceFunction(env)// parallelSourceFunction(env) richParallelSourceFunction(env) env.execute("DataStreamSourceApp") } def richParallelSourceFunction(env: StreamExecutionEnvironment): Unit = {val data = env.addSource(new CustomRichParallelSourceFunction) .setParallelism(2) data.print().setParallelism(1) } def parallelSourceFunction(env: StreamExecutionEnvironment): Unit = {val data = env.addSource(new CustomParallelSourceFunction) .setParallelism(2) data.print().setParallelism(1) } def nonParallelSourceFunction(env: StreamExecutionEnvironment): Unit = {val data = env.addSource(new CustomNonParallelSourceFunction) data.print().setParallelism(1) } def socketFunction(env: StreamExecutionEnvironment): Unit = {val data = env.socketTextStream("127.0.0.1",9999) data.print().setParallelism(1) } }
運(yùn)行結(jié)果與可并行數(shù)據(jù)源相同
流算子
map和filter
public class DataStreamTransformationApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); filterFunction(env); env.execute("DataStreamTransformationApp"); }public static void filterFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction()); data.map(x -> { System.out.println("接收到: " + x); return x; }).filter(x -> x % 2 == 0).print().setParallelism(1); } }
運(yùn)行結(jié)果(省略日志)
接收到: 1 接收到: 2 2 接收到: 3 接收到: 4 4 接收到: 5 接收到: 6 6 . .
Scala代碼
import com.guanjian.flink.scala.until.CustomNonParallelSourceFunctionimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.api.scala._object DataStreamTransformationApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment filterFunction(env) env.execute("DataStreamTransformationApp") } def filterFunction(env: StreamExecutionEnvironment): Unit = {val data = env.addSource(new CustomNonParallelSourceFunction) data.map(x => { println("接收到: " + x) x }).filter(_ % 2 == 0).print().setParallelism(1) } }
運(yùn)行結(jié)果(省略日志)
接收到: 1 接收到: 2 2 接收到: 3 接收到: 4 4 接收到: 5 接收到: 6 6 . .
union
public class DataStreamTransformationApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();// filterFunction(env); unionFunction(env); env.execute("DataStreamTransformationApp"); }public static void unionFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data1 = env.addSource(new CustomNonParallelSourceFunction()); DataStreamSource<Long> data2 = env.addSource(new CustomNonParallelSourceFunction()); data1.union(data2).print().setParallelism(1); }public static void filterFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction()); data.map(x -> { System.out.println("接收到: " + x); return x; }).filter(x -> x % 2 == 0).print().setParallelism(1); } }
運(yùn)行結(jié)果(省略日志)
1 1 2 2 3 3 4 4 5 5 . .
Scala代碼
import com.guanjian.flink.scala.until.CustomNonParallelSourceFunctionimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.api.scala._object DataStreamTransformationApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment// filterFunction(env) unionFunction(env) env.execute("DataStreamTransformationApp") } def unionFunction(env: StreamExecutionEnvironment): Unit = {val data1 = env.addSource(new CustomNonParallelSourceFunction)val data2 = env.addSource(new CustomNonParallelSourceFunction) data1.union(data2).print().setParallelism(1) } def filterFunction(env: StreamExecutionEnvironment): Unit = {val data = env.addSource(new CustomNonParallelSourceFunction) data.map(x => { println("接收到: " + x) x }).filter(_ % 2 == 0).print().setParallelism(1) } }
運(yùn)行結(jié)果(省略日志)
1 1 2 2 3 3 4 4 5 5 . .
split和select
將一個流拆成多個流以及挑選其中一個流
public class DataStreamTransformationApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();// filterFunction(env);// unionFunction(env); splitSelectFunction(env); env.execute("DataStreamTransformationApp"); }public static void splitSelectFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction()); SplitStream<Long> splits = data.split(value -> { List<String> list = new ArrayList<>(); if (value % 2 == 0) { list.add("偶數(shù)"); } else { list.add("奇數(shù)"); }return list; }); splits.select("奇數(shù)").print().setParallelism(1); }public static void unionFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data1 = env.addSource(new CustomNonParallelSourceFunction()); DataStreamSource<Long> data2 = env.addSource(new CustomNonParallelSourceFunction()); data1.union(data2).print().setParallelism(1); }public static void filterFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction()); data.map(x -> { System.out.println("接收到: " + x); return x; }).filter(x -> x % 2 == 0).print().setParallelism(1); } }
運(yùn)行結(jié)果(省略日志)
1 3 5 7 9 11 . .
Scala代碼
import com.guanjian.flink.scala.until.CustomNonParallelSourceFunctionimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.streaming.api.collector.selector.OutputSelectorimport java.util.ArrayListimport java.lang.Iterableobject DataStreamTransformationApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment// filterFunction(env)// unionFunction(env) splitSelectFunction(env) env.execute("DataStreamTransformationApp") } def splitSelectFunction(env: StreamExecutionEnvironment): Unit = {val data = env.addSource(new CustomNonParallelSourceFunction)val splits = data.split(new OutputSelector[Long] { override def select(value: Long): Iterable[String] = {val list = new ArrayList[String]if (value % 2 == 0) { list.add("偶數(shù)") } else { list.add("奇數(shù)") } list } }) splits.select("奇數(shù)").print().setParallelism(1) } def unionFunction(env: StreamExecutionEnvironment): Unit = {val data1 = env.addSource(new CustomNonParallelSourceFunction)val data2 = env.addSource(new CustomNonParallelSourceFunction) data1.union(data2).print().setParallelism(1) } def filterFunction(env: StreamExecutionEnvironment): Unit = {val data = env.addSource(new CustomNonParallelSourceFunction) data.map(x => { println("接收到: " + x) x }).filter(_ % 2 == 0).print().setParallelism(1) } }
運(yùn)行結(jié)果(省略日志)
1 3 5 7 9 11 . .
這里需要說明的是split已經(jīng)被設(shè)置為不推薦使用的方法
@deprecateddef split(selector: OutputSelector[T]): SplitStream[T] = asScalaStream(stream.split(selector))
因?yàn)镺utputSelector函數(shù)式接口的返回類型為一個Java專屬類型,對于Scala是不友好的,所以Scala這里也無法使用lambda表達(dá)式
當(dāng)然select也可以選取多個流
public class DataStreamTransformationApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();// filterFunction(env);// unionFunction(env); splitSelectFunction(env); env.execute("DataStreamTransformationApp"); }public static void splitSelectFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction()); SplitStream<Long> splits = data.split(value -> { List<String> list = new ArrayList<>(); if (value % 2 == 0) { list.add("偶數(shù)"); } else { list.add("奇數(shù)"); }return list; }); splits.select("奇數(shù)","偶數(shù)").print().setParallelism(1); }public static void unionFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data1 = env.addSource(new CustomNonParallelSourceFunction()); DataStreamSource<Long> data2 = env.addSource(new CustomNonParallelSourceFunction()); data1.union(data2).print().setParallelism(1); }public static void filterFunction(StreamExecutionEnvironment env) { DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction()); data.map(x -> { System.out.println("接收到: " + x); return x; }).filter(x -> x % 2 == 0).print().setParallelism(1); } }
運(yùn)行結(jié)果(省略日志)
1 2 3 4 5 6 . .
Scala代碼修改是一樣的,這里就不重復(fù)了
流Sink
自定義Sink
將socket中的數(shù)據(jù)傳入mysql中
@Data@ToString@AllArgsConstructor@NoArgsConstructorpublic class Student {private int id; private String name; private int age;}
public class SinkToMySQL extends RichSinkFunction<Student> {private Connection connection; private PreparedStatement pstmt; private Connection getConnection() { Connection conn = null; try { Class.forName("com.mysql.cj.jdbc.Driver"); String url = "jdbc:mysql://127.0.0.1:3306/flink"; conn = DriverManager.getConnection(url,"root","abcd123"); }catch (Exception e) { e.printStackTrace(); }return conn; }@Override public void open(Configuration parameters) throws Exception {super.open(parameters); connection = getConnection(); String sql = "insert into student(id,name,age) values (?,?,?)"; pstmt = connection.prepareStatement(sql); }@Override public void invoke(Student value) throws Exception { System.out.println("invoke--------"); pstmt.setInt(1,value.getId()); pstmt.setString(2,value.getName()); pstmt.setInt(3,value.getAge()); pstmt.executeUpdate(); }@Override public void close() throws Exception {super.close(); if (pstmt != null) {pstmt.close(); }if (connection != null) {connection.close(); } } }
public class CustomSinkToMySQL {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStreamSource<String> source = env.socketTextStream("127.0.0.1", 9999); SingleOutputStreamOperator<Student> studentStream = source.map(value -> { String[] splits = value.split(","); Student stu = new Student(Integer.parseInt(splits[0]), splits[1], Integer.parseInt(splits[2])); return stu; }).returns(Student.class); studentStream.addSink(new SinkToMySQL()); env.execute("CustomSinkToMySQL"); } }
代碼執(zhí)行前執(zhí)行
nc -lk 9999
執(zhí)行代碼后輸入
執(zhí)行結(jié)果
Scala代碼
class Student(var id: Int,var name: String,var age: Int) { }
import java.sql.{Connection, DriverManager, PreparedStatement}import com.guanjian.flink.scala.test.Studentimport org.apache.flink.configuration.Configurationimport org.apache.flink.streaming.api.functions.sink.RichSinkFunctionclass SinkToMySQL extends RichSinkFunction[Student] { var connection: Connection = null var pstmt: PreparedStatement = null def getConnection:Connection = {var conn: Connection = null Class.forName("com.mysql.cj.jdbc.Driver")val url = "jdbc:mysql://127.0.0.1:3306/flink" conn = DriverManager.getConnection(url, "root", "abcd123") conn } override def open(parameters: Configuration): Unit = {connection = getConnectionval sql = "insert into student(id,name,age) values (?,?,?)" pstmt = connection.prepareStatement(sql) } override def invoke(value: Student): Unit = {println("invoke--------")pstmt.setInt(1,value.id)pstmt.setString(2,value.name)pstmt.setInt(3,value.age)pstmt.executeUpdate() } override def close(): Unit = {if (pstmt != null) { pstmt.close() }if (connection != null) { connection.close() } } }
import com.guanjian.flink.scala.until.SinkToMySQLimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.api.scala._object CustomSinkToMySQL { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment val source = env.socketTextStream("127.0.0.1",9999)val studentStream = source.map(value => { val splits = value.split(",") val stu = new Student(splits(0).toInt, splits(1), splits(2).toInt) stu }) studentStream.addSink(new SinkToMySQL) env.execute("CustomSinkToMySQL") } }
控制臺輸入
運(yùn)行結(jié)果
要使用flink的Table API,Java工程需要添加Scala依賴庫
<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-scala_${scala.binary.version}</artifactId> <version>${flink.version}</version></dependency><dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId> <version>${flink.version}</version></dependency><dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version></dependency><dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table_2.11</artifactId> <version>${flink.version}</version></dependency>
在data目錄下添加一個sales.csv文件,文件內(nèi)容如下
transactionId,customerId,itemId,amountPaid 111,1,1,100.0 112,2,3,505.0 113,3,3,510.0 114,4,4,600.0 115,1,2,500.0 116,1,2,500.0 117,1,2,500.0 118,1,2,600.0 119,2,3,400.0 120,1,2,500.0 121,1,4,500.0 122,1,2,500.0 123,1,4,500.0 124,1,2,600.0
import lombok.AllArgsConstructor;import lombok.Data;import lombok.NoArgsConstructor;import lombok.ToString;import org.apache.flink.api.java.DataSet;import org.apache.flink.api.java.ExecutionEnvironment;import org.apache.flink.api.java.operators.DataSource;import org.apache.flink.table.api.Table;import org.apache.flink.table.api.java.BatchTableEnvironment;import org.apache.flink.types.Row;public class TableSQLAPI {@Data @ToString @AllArgsConstructor @NoArgsConstructor public static class SalesLog {private String transactionId; private String customerId; private String itemId; private Double amountPaid; }public static void main(String[] args) throws Exception { ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); BatchTableEnvironment tableEnv = BatchTableEnvironment.getTableEnvironment(env); String filePath = "/Users/admin/Downloads/flink/data/sales.csv"; DataSource<SalesLog> csv = env.readCsvFile(filePath).ignoreFirstLine() .pojoType(SalesLog.class, "transactionId", "customerId", "itemId", "amountPaid"); Table sales = tableEnv.fromDataSet(csv); tableEnv.registerTable("sales",sales); Table resultTable = tableEnv.sqlQuery("select customerId,sum(amountPaid) money from sales " +"group by customerId"); DataSet<Row> result = tableEnv.toDataSet(resultTable, Row.class); result.print(); } }
運(yùn)行結(jié)果(省略日志)
3,510.0 4,600.0 1,4800.0 2,905.0
Scala代碼
Scala項(xiàng)目同樣要放入依賴
<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table_2.11</artifactId> <version>${flink.version}</version></dependency>
import org.apache.flink.api.scala.ExecutionEnvironmentimport org.apache.flink.table.api.TableEnvironmentimport org.apache.flink.types.Rowimport org.apache.flink.api.scala._object TableSQLAPI { case class SalesLog(transactionId: String,customerId: String,itemId: String,amountPaid: Double) def main(args: Array[String]): Unit = {val env = ExecutionEnvironment.getExecutionEnvironment val tableEnv = TableEnvironment.getTableEnvironment(env)val filePath = "/Users/admin/Downloads/flink/data/sales.csv" val csv = env.readCsvFile[SalesLog](filePath,ignoreFirstLine = true,includedFields = Array(0,1,2,3))val sales = tableEnv.fromDataSet(csv) tableEnv.registerTable("sales",sales)val resultTable = tableEnv.sqlQuery("select customerId,sum(amountPaid) money from sales " + "group by customerId") tableEnv.toDataSet[Row](resultTable).print() } }
運(yùn)行結(jié)果(省略日志)
3,510.0 4,600.0 1,4800.0 2,905.0
時間和窗口
Flink中有三個時間是比較重要的,包括事件時間(Event Time),處理時間(Processing Time),進(jìn)入Flink系統(tǒng)的時間(Ingestion Time)
通常我們都是使用事件時間來作為基準(zhǔn)。
設(shè)置時間的代碼
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
事件時間通常以時間戳的形式會包含在傳入的數(shù)據(jù)中的一個字段,通過提取,來決定窗口什么時候來執(zhí)行。
窗口(Windows)是主要進(jìn)行流處理(無限流)中,將流數(shù)據(jù)拆成按照時間段或者大小的一個個的數(shù)據(jù)桶,窗口分為兩種,一種是根據(jù)key來統(tǒng)計(jì),一種是全部的。它的處理過程如下
Keyed Windows stream .keyBy(...) <- keyed versus non-keyed windows .window(...) <- required: "assigner" [.trigger(...)] <- optional: "trigger" (else default trigger) [.evictor(...)] <- optional: "evictor" (else no evictor) [.allowedLateness(...)] <- optional: "lateness" (else zero) [.sideOutputLateData(...)] <- optional: "output tag" (else no side output for late data) .reduce/aggregate/fold/apply() <- required: "function" [.getSideOutput(...)] <- optional: "output tag" Non-Keyed Windows stream .windowAll(...) <- required: "assigner" [.trigger(...)] <- optional: "trigger" (else default trigger) [.evictor(...)] <- optional: "evictor" (else no evictor) [.allowedLateness(...)] <- optional: "lateness" (else zero) [.sideOutputLateData(...)] <- optional: "output tag" (else no side output for late data) .reduce/aggregate/fold/apply() <- required: "function" [.getSideOutput(...)] <- optional: "output tag"
窗口觸發(fā)可以有兩種條件,比方說達(dá)到了一定的數(shù)量或者水印(watermark)達(dá)到了條件。watermark是一種衡量Event Time進(jìn)展的機(jī)制,
watermark是用于處理亂序事件的,而正確的處理亂序事件,通常用watermark機(jī)制結(jié)合window來實(shí)現(xiàn)。
我們知道,流處理從事件產(chǎn)生,到流經(jīng)source,再到operator,中間是有一個過程和時間的。雖然大部分情況下,流到operator的數(shù)據(jù)都是按照事件產(chǎn)生的時間順序來的,但是也不排除由于網(wǎng)絡(luò)、背壓等原因,導(dǎo)致亂序的產(chǎn)生(out-of-order或者說late element)。
但是對于late element,我們又不能無限期的等下去,必須要有個機(jī)制來保證一個特定的時間后,必須觸發(fā)window去進(jìn)行計(jì)算了。這個特別的機(jī)制,就是watermark。
我們從socket接收數(shù)據(jù),然后經(jīng)過map后立刻抽取timetamp并生成watermark,之后應(yīng)用window來看看watermark和event time如何變化,才導(dǎo)致window被觸發(fā)的。
public class WindowsApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); DataStreamSource<String> input = env.socketTextStream("127.0.0.1", 9999); //將數(shù)據(jù)流(key,時間戳組成的字符串)轉(zhuǎn)換成元組 SingleOutputStreamOperator<Tuple2<String, Long>> inputMap = input.map(new MapFunction<String, Tuple2<String, Long>>() {@Override public Tuple2<String, Long> map(String value) throws Exception { String[] splits = value.split(","); return new Tuple2<>(splits[0], Long.parseLong(splits[1])); } }); //提取時間戳,生成水印 SingleOutputStreamOperator<Tuple2<String, Long>> watermarks = inputMap.setParallelism(1).assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Tuple2<String, Long>>() {private Long currentMaxTimestamp = 0L; //最大允許的亂序時間為10秒 private Long maxOutOfOrderness = 10000L; private Watermark watermark; private SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS"); @Nullable @Override public Watermark getCurrentWatermark() {watermark = new Watermark(currentMaxTimestamp - maxOutOfOrderness); return watermark; }@Override public long extractTimestamp(Tuple2<String, Long> element, long previousElementTimestamp) { Long timestamp = element.getField(1); currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp); System.out.println("timestamp:" + element.getField(0) + "," + element.getField(1) + "|" + format.format(new Date((Long)element.getField(1))) +"," + currentMaxTimestamp + "|" + format.format(new Date(currentMaxTimestamp)) +"," + watermark.toString() + "|" + format.format(new Date(watermark.getTimestamp()))); return timestamp; } }); //根據(jù)水印的條件,來執(zhí)行我們需要的方法 //如果水印條件不滿足,該方法是永遠(yuǎn)不會執(zhí)行的 watermarks.keyBy(x -> (String)x.getField(0)).timeWindow(Time.seconds(3)) .apply(new WindowFunction<Tuple2<String,Long>, Tuple6<String,Integer,String,String,String,String>, String, TimeWindow>() {@Override public void apply(String key, TimeWindow window, Iterable<Tuple2<String, Long>> input, Collector<Tuple6<String, Integer, String, String, String, String>> out) throws Exception { List<Tuple2<String,Long>> list = (List) input; //將亂序進(jìn)行有序整理 List<Long> collect = list.stream().map(x -> (Long)x.getField(1)).sorted().collect(Collectors.toList()); SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS"); out.collect(new Tuple6<>(key,list.size(), format.format(collect.get(0)),format.format(collect.get(collect.size() - 1)), format.format(window.getStart()),format.format(window.getEnd()))); } }).print().setParallelism(1); env.execute("WindowsApp"); } }
在控制臺執(zhí)行nc -lk 9999后,運(yùn)行我們的程序,控制臺輸入
000001,1461756862000
打印
timestamp:000001,1461756862000|2016-04-27 19:34:22.000,1461756862000|2016-04-27 19:34:22.000,Watermark @ -10000|1970-01-01 07:59:50.000
由該執(zhí)行結(jié)果watermark = -10000,我們可以看出,水印是先獲取的,再執(zhí)行時間戳的提取。
控制臺繼續(xù)輸入
000001,1461756866000
打印
timestamp:000001,1461756866000|2016-04-27 19:34:26.000,1461756866000|2016-04-27 19:34:26.000,Watermark @ 1461756852000|2016-04-27 19:34:12.000
由于水印是先獲取的,則此時的水印1461756852000|2016-04-27 19:34:12.000是第一次輸入所產(chǎn)生的。
控制臺繼續(xù)輸入
000001,1461756872000
打印
timestamp:000001,1461756872000|2016-04-27 19:34:32.000,1461756872000|2016-04-27 19:34:32.000,Watermark @ 1461756856000|2016-04-27 19:34:16.000
此時我們的時間戳來到了32秒,比第一個數(shù)據(jù)的時間多出了10秒。
控制臺繼續(xù)輸入
000001,1461756873000
打印
timestamp:000001,1461756873000|2016-04-27 19:34:33.000,1461756873000|2016-04-27 19:34:33.000,Watermark @ 1461756862000|2016-04-27 19:34:22.000
此時我們的時間戳來到了33秒,比第一個數(shù)據(jù)的時間多出了11秒。此時依然沒有觸發(fā)Windows窗體執(zhí)行代碼。
控制臺繼續(xù)輸入
000001,1461756874000
打印
timestamp:000001,1461756874000|2016-04-27 19:34:34.000,1461756874000|2016-04-27 19:34:34.000,Watermark @ 1461756863000|2016-04-27 19:34:23.000 (000001,1,2016-04-27 19:34:22.000,2016-04-27 19:34:22.000,2016-04-27 19:34:21.000,2016-04-27 19:34:24.000)
此時觸發(fā)了Windows窗體執(zhí)行代碼。輸出了一個六元組
控制臺繼續(xù)輸入
000001,1461756876000
打印
timestamp:000001,1461756876000|2016-04-27 19:34:36.000,1461756876000|2016-04-27 19:34:36.000,Watermark @ 1461756864000|2016-04-27 19:34:24.000
此時我們可以看到該水印是上一條數(shù)據(jù)會產(chǎn)生的,剛好在上一條數(shù)據(jù)的時間窗口內(nèi)2016-04-27 19:34:22.000,2016-04-27 19:34:21.000,2016-04-27 19:34:24.000,觸發(fā)Windows執(zhí)行代碼
則觸發(fā)條件為
watermark時間 >= window_end_time
在[window_start_time,window_end_time)中有數(shù)據(jù)存在
Scala代碼
import java.text.SimpleDateFormatimport org.apache.flink.streaming.api.TimeCharacteristicimport org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarksimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.streaming.api.scala.function.WindowFunctionimport org.apache.flink.streaming.api.watermark.Watermarkimport org.apache.flink.streaming.api.windowing.time.Timeimport org.apache.flink.streaming.api.windowing.windows.TimeWindowimport org.apache.flink.util.Collectorimport org.apache.flink.api.scala._object WindowsApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)val input = env.socketTextStream("127.0.0.1",9999)val inputMap = input.map(f => { val splits = f.split(",") (splits(0), splits(1).toLong) })val watermarks = inputMap.setParallelism(1).assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[(String, Long)] { var currentMaxTimestamp = 0L var maxOutofOrderness = 10000L var watermark: Watermark = null val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS") override def getCurrentWatermark: Watermark = {watermark = new Watermark(currentMaxTimestamp - maxOutofOrderness)watermark } override def extractTimestamp(element: (String, Long), previousElementTimestamp: Long): Long = {val timestamp = element._2currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp)println("timestamp:" + element._1 + "," + element._2 + "|" + format.format(element._2) + "," + currentMaxTimestamp + "|" + format.format(currentMaxTimestamp) + "," + watermark.toString + "|" + format.format(watermark.getTimestamp)) timestamp } }) watermarks.keyBy(_._1).timeWindow(Time.seconds(3)) .apply(new WindowFunction[(String,Long),(String,Int,String,String,String,String),String,TimeWindow] {override def apply(key: String, window: TimeWindow, input: Iterable[(String, Long)], out: Collector[(String, Int, String, String, String, String)]): Unit = { val list = input.toList.sortBy(_._2) val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS") out.collect((key,input.size,format.format(list.head._2),format.format(list.last._2),format.format(window.getStart),format.format(window.getEnd))) } }).print().setParallelism(1) env.execute("WindowsApp") } }
滾動窗口和滑動窗口
滾動窗口就是一個不重疊的時間分片,落入到該時間分片的數(shù)據(jù)都會被該窗口計(jì)算。上面的例子就是一個滾動窗口
代碼中可以寫成
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
也可以簡寫成
.timeWindow(Time.seconds(5))
滑動窗口是一個可以重疊的時間分片,同樣的數(shù)據(jù)可以落入不同的窗口,不同的窗口都會計(jì)算落入自己時間分片的數(shù)據(jù)。
代碼可以寫成
.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
或者簡寫成
.timeWindow(Time.seconds(10),Time.seconds(5))
public class SliderWindowsApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStreamSource<String> text = env.socketTextStream("127.0.0.1", 9999); text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {@Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception { String[] splits = value.split(","); Stream.of(splits).forEach(token -> out.collect(new Tuple2<>(token,1))); } }).keyBy(0).timeWindow(Time.seconds(10),Time.seconds(5)) .sum(1).print().setParallelism(1); env.execute("SliderWindowsApp"); } }
控制臺輸入
a,b,c,d,e,f a,b,c,d,e,f a,b,c,d,e,f
運(yùn)行結(jié)果
(d,3) (a,3) (e,3) (f,3) (c,3) (b,3) (c,3) (f,3) (b,3) (d,3) (e,3) (a,3)
從結(jié)果我們可以看到,數(shù)據(jù)被運(yùn)算了兩次
Scala代碼
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.streaming.api.windowing.time.Timeimport org.apache.flink.api.scala._object SliderWindowsApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment val text = env.socketTextStream("127.0.0.1",9999) text.flatMap(_.split(",")) .map((_,1)) .keyBy(0) .timeWindow(Time.seconds(10),Time.seconds(5)) .sum(1) .print() .setParallelism(1) env.execute("SliderWindowsApp") } }
Windows Functions
RedueFunction
這是一個增量函數(shù),即它不會把時間窗口內(nèi)的所有數(shù)據(jù)統(tǒng)一處理,只會一條一條處理
public class ReduceWindowsApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStreamSource<String> text = env.socketTextStream("127.0.0.1", 9999); text.flatMap((FlatMapFunction<String,String>)(f, collector) -> { String[] splits = f.split(","); Stream.of(splits).forEach(collector::collect); }).returns(String.class) .map(new MapFunction<String, Tuple2<Integer,Integer>>() {@Override public Tuple2<Integer, Integer> map(String value) throws Exception {return new Tuple2<>(1,Integer.parseInt(value)); } }) .keyBy(0) .timeWindow(Time.seconds(5)) .reduce((x,y) -> new Tuple2<>(x.getField(0),(int)x.getField(1) + (int)y.getField(1))) .print().setParallelism(1); env.execute("ReduceWindowsApp"); } }
控制臺輸入
1,2,3,4,5 7,8,9
運(yùn)行結(jié)果
(1,15) (1,24)
Scala代碼
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.streaming.api.windowing.time.Timeimport org.apache.flink.api.scala._object ReduceWindowsApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment val text = env.socketTextStream("127.0.0.1",9999) text.flatMap(_.split(",")) .map(x => (1,x.toInt)) .keyBy(0) .timeWindow(Time.seconds(5)) .reduce((x,y) => (x._1,x._2 + y._2)) .print() .setParallelism(1) env.execute("ReduceWindowsApp") } }
ProcessFunction
這是一個全量函數(shù),即它會把一個時間窗口內(nèi)的所有數(shù)據(jù)一起處理
public class ProcessWindowsApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStreamSource<String> text = env.socketTextStream("127.0.0.1", 9999); text.flatMap((FlatMapFunction<String,String>)(f, collector) -> { String[] splits = f.split(","); Stream.of(splits).forEach(collector::collect); }).returns(String.class) .map(new MapFunction<String, Tuple2<Integer,Integer>>() {@Override public Tuple2<Integer, Integer> map(String value) throws Exception {return new Tuple2<>(1,Integer.parseInt(value)); } }) .keyBy(0) .timeWindow(Time.seconds(5)) .process(new ProcessWindowFunction<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>, Tuple, TimeWindow>() {@Override public void process(Tuple tuple, Context context, Iterable<Tuple2<Integer, Integer>> elements, Collector<Tuple2<Integer, Integer>> out) throws Exception { List<Tuple2<Integer,Integer>> list = (List) elements; out.collect(list.stream().reduce((x, y) -> new Tuple2<>(x.getField(0), (int) x.getField(1) + (int) y.getField(1))) .get()); } }).print().setParallelism(1); env.execute("ProcessWindowsApp"); } }
控制臺輸入
1,2,3,4,5 7,8,9
運(yùn)行結(jié)果
(1,39)
Scala代碼
import org.apache.flink.api.java.tuple.Tupleimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.streaming.api.scala.function.ProcessWindowFunctionimport org.apache.flink.streaming.api.windowing.time.Timeimport org.apache.flink.streaming.api.windowing.windows.TimeWindowimport org.apache.flink.util.Collectorimport org.apache.flink.api.scala._object ProcessWindowsApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment val text = env.socketTextStream("127.0.0.1",9999) text.flatMap(_.split(",")) .map(x => (1,x.toInt)) .keyBy(0) .timeWindow(Time.seconds(5)) .process(new ProcessWindowFunction[(Int,Int),(Int,Int),Tuple,TimeWindow] { override def process(key: Tuple, context: Context, elements: Iterable[(Int, Int)], out: Collector[(Int, Int)]): Unit = {val list = elements.toList out.collect(list.reduce((x,y) => (x._1,x._2 + y._2))) } }) .print() .setParallelism(1) env.execute("ReduceWindowsApp") } }
Flink提供了很多內(nèi)置的數(shù)據(jù)源或者輸出的連接Connector,當(dāng)前包括的有
Apache Kafka (source/sink) Apache Cassandra (sink) Amazon Kinesis Streams (source/sink) Elasticsearch (sink) Hadoop FileSystem (sink) RabbitMQ (source/sink) Apache NiFi (source/sink) Twitter Streaming API (source)
HDFS Connector
這是一個把數(shù)據(jù)流輸出到Hadoop HDFS分布式文件系統(tǒng)的連接,要使用該連接,需要添加以下依賴
<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-filesystem_2.11</artifactId> <version>${flink.version}</version></dependency><dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version></dependency>
版本號根據(jù)自己實(shí)際情況來選擇,我這里hadoop的版本號為
<hadoop.version>2.8.1</hadoop.version>
在data下新建一個hdfssink的文件夾,此時文件夾內(nèi)容為空
public class FileSystemSinkApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStreamSource<String> data = env.socketTextStream("127.0.0.1", 9999); String filePath = "/Users/admin/Downloads/flink/data/hdfssink"; BucketingSink<String> sink = new BucketingSink<>(filePath); sink.setBucketer(new DateTimeBucketer<>("yyyy-MM-dd--HHmm")); sink.setWriter(new StringWriter<>());// sink.setBatchSize(1024 * 1024 * 400); sink.setBatchRolloverInterval(20); data.addSink(sink); env.execute("FileSystemSinkApp"); } }
我這里并沒有真正使用hadoop的hdfs,hdfs的搭建可以參考Hadoop hdfs+Spark配置 。而是本地目錄,在控制臺隨便輸入
adf dsdf wfdgg
我們可以看到在hdfssink文件夾下面多了一個
2021-01-15--0627
的文件夾,進(jìn)入該文件夾后可以看到3個文件
_part-4-0.pending _part-5-0.pending _part-6-0.pending
查看三個文件
(base) -bash-3.2$ cat _part-4-0.pending
adf
(base) -bash-3.2$ cat _part-5-0.pending
dsdf
(base) -bash-3.2$ cat _part-6-0.pending
wfdgg
BucketingSink其實(shí)是RichSinkFunction抽象類的子類,跟之前寫的自定義Sink的SinkToMySQL是一樣的。
Scala代碼
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.streaming.connectors.fs.StringWriterimport org.apache.flink.streaming.connectors.fs.bucketing.{BucketingSink, DateTimeBucketer}object FileSystemSinkApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment val data = env.socketTextStream("127.0.0.1",9999)val filePath = "/Users/admin/Downloads/flink/data/hdfssink" val sink = new BucketingSink[String](filePath) sink.setBucketer(new DateTimeBucketer[String]("yyyy-MM-dd--HHmm")) sink.setWriter(new StringWriter[String]())// sink.setBatchSize(1024 * 1024 * 400) sink.setBatchRolloverInterval(20) data.addSink(sink) env.execute("FileSystemSinkApp") } }
Kafka Connector
要使用Kafka Connector,當(dāng)然首先必須安裝Kafka。先安裝一個zookeeper 3.4.5,kafka 1.1.1
由于我的Kafka是安裝在阿里云上面的,本地訪問需要配置一下,在kafka的config目錄下修改server.properties
advertised.listeners=PLAINTEXT://外網(wǎng)IP:9092 host.name=內(nèi)網(wǎng)IP
同時阿里云需要開放9092端口
kafka啟動
./kafka-server-start.sh -daemon ../config/server.properties
創(chuàng)建topic
./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic pktest
此時我們進(jìn)入zookeeper可以看到該topic
[zk: localhost:2181(CONNECTED) 2] ls /brokers/topics [pktest, __consumer_offsets]
查看topic
./kafka-topics.sh --list --zookeeper localhost:2181
該命令返回的結(jié)果為
[bin]# ./kafka-topics.sh --list --zookeeper localhost:2181 __consumer_offsets pktest
啟動生產(chǎn)者
./kafka-console-producer.sh --broker-list localhost:9092 --topic pktest
但由于我們是在阿里云上面啟動,則啟動生產(chǎn)者需要更改為
./kafka-console-producer.sh --broker-list 外網(wǎng)ip:9092 --topic pktest
啟動消費(fèi)者
./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic pktest
但由于我們是在阿里云上面啟動,則啟動消費(fèi)者需要更改為
./kafka-console-consumer.sh --bootstrap-server 外網(wǎng)ip:9092 --topic pktest
此時我們在生產(chǎn)者窗口輸入,消費(fèi)者窗口這邊就會獲取
Kafka作為Source代碼,添加依賴
<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka_2.11</artifactId> <version>${flink.version}</version></dependency><dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>1.1.1</version></dependency>
public class KafkaConnectorConsumerApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(4000); env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE); env.getCheckpointConfig().setCheckpointTimeout(10000); env.getCheckpointConfig().setMaxConcurrentCheckpoints(1); String topic = "pktest"; Properties properties = new Properties(); properties.setProperty("bootstrap.servers","外網(wǎng)ip:9092"); properties.setProperty("group.id","test"); DataStreamSource<String> data = env.addSource(new FlinkKafkaConsumer<>(topic, new SimpleStringSchema(), properties)); data.print().setParallelism(1); env.execute("KafkaConnectorConsumerApp"); } }
運(yùn)行結(jié)果
服務(wù)器輸入
[bin]# ./kafka-console-producer.sh --broker-list 外網(wǎng)ip:9092 --topic pktest >sdfa
打印
sdfa
Scala代碼
import java.util.Propertiesimport org.apache.flink.api.common.serialization.SimpleStringSchemaimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerimport org.apache.flink.api.scala._import org.apache.flink.streaming.api.CheckpointingModeobject KafkaConnectorConsumerApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment env.enableCheckpointing(4000) env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE) env.getCheckpointConfig.setCheckpointTimeout(10000) env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)val topic = "pktest" val properties = new Properties properties.setProperty("bootstrap.servers", "外網(wǎng)ip:9092") properties.setProperty("group.id","test")val data = env.addSource(new FlinkKafkaConsumer[String](topic, new SimpleStringSchema, properties)) data.print().setParallelism(1) env.execute("KafkaConnectorConsumerApp") } }
Kafka作為Sink
public class KafkaConnectorProducerApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(4000); env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE); env.getCheckpointConfig().setCheckpointTimeout(10000); env.getCheckpointConfig().setMaxConcurrentCheckpoints(1); DataStreamSource<String> data = env.socketTextStream("127.0.0.1", 9999); String topic = "pktest"; Properties properties = new Properties(); properties.setProperty("bootstrap.servers","外網(wǎng)ip:9092"); FlinkKafkaProducer<String> kafkaSink = new FlinkKafkaProducer<>(topic, new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()),properties, FlinkKafkaProducer.Semantic.EXACTLY_ONCE); data.addSink(kafkaSink); env.execute("KafkaConnectorProducerApp"); } }
控制臺輸入
sdfae dffe
服務(wù)器打印
[bin]# ./kafka-console-consumer.sh --bootstrap-server 外網(wǎng)ip:9092 --topic pktest sdfae dffe
Scala代碼
import java.util.Propertiesimport org.apache.flink.api.common.serialization.SimpleStringSchemaimport org.apache.flink.streaming.api.CheckpointingModeimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerimport org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapperobject KafkaConnectorProducerApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment env.enableCheckpointing(4000) env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE) env.getCheckpointConfig.setCheckpointTimeout(10000) env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)val data = env.socketTextStream("127.0.0.1",9999)val topic = "pktest" val properties = new Properties properties.setProperty("bootstrap.servers", "外網(wǎng)ip:9092")val kafkaSink = new FlinkKafkaProducer[String](topic, new KeyedSerializationSchemaWrapper[String](new SimpleStringSchema),properties, FlinkKafkaProducer.Semantic.EXACTLY_ONCE) data.addSink(kafkaSink) env.execute("KafkaConnectorProducerApp") } }
單機(jī)部署
下載地址:https://archive.apache.org/dist/flink/flink-1.7.2/flink-1.7.2-bin-hadoop28-scala_2.11.tgz
由于我這里用的是1.7.2(當(dāng)然你可以使用其他版本),下載解壓縮后,進(jìn)入bin目錄,執(zhí)行
./start-cluster.sh
進(jìn)入web界面 http://外網(wǎng)ip:8081/
提交一個測試用例
nc -lk 9000
退出bin目錄,返回上級目錄,執(zhí)行
./bin/flink run examples/streaming/SocketWindowWordCount.jar --port 9000
此時在web界面中可以看到這個運(yùn)行的任務(wù)
點(diǎn)RUNNING按鈕見到的如下
雖然我們在nc中敲入一些字符,比如
a f g r a d a f g r a d a f g r a d
但并沒有打印的地方,我們查看結(jié)果需要在log目錄下查看
[log]# cat flink-root-taskexecutor-0-iZ7xvi8yoh0wspvk6rjp7cZ.out a : 6 d : 3 r : 3 g : 3 f : 3 : 90
上傳我們自己的jar包
上傳之前,修改一下我們需要運(yùn)行的main方法的類
<transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass>com.guanjian.flink.java.test.StreamingJavaApp</mainClass> </transformer></transformers>
由于我們代碼的端口為9999,執(zhí)行
nc -lk 9999
上傳后執(zhí)行(上傳至flink的新建test目錄下)
./bin/flink run test/flink-train-java-1.0.jar
nc下輸入
a d e g a d g f
在log下執(zhí)行
cat flink-root-taskexecutor-0-iZ7xvi8yoh0wspvk6rjp7cZ.out
可以看到除了之前的記錄,多出了幾條新的記錄
a : 6 d : 3 r : 3 g : 3 f : 3 : 90 (a,2) (f,1) (g,2) (e,1) (d,2)
Yarn集群部署
要進(jìn)行Yarn集群部署,得要先安裝Hadoop,我這里Hadoop的版本為2.8.1
進(jìn)入Hadoop安裝目錄下的etc/hadoop文件夾
首先依然是hadoop-env.sh的配置,需要配置一下JAVA_HOME
export JAVA_HOME=/home/soft/jdk1.8.0_161
core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://host1:8020</value> </property> </configuration>
hdfs-site.xml
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/opt/hadoop2/tmp/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/opt/hadoop2/tmp/dfs/data</value> </property> </configuration>
此時需要新建這兩個目錄
mkdir -p /opt/hadoop2/tmp/dfs/name mkdir -p /opt/hadoop2/tmp/dfs/data
yarn-site.xml
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>host1</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> </configuration>
mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
啟動后,可以看到50070的hdfs頁面以及8088的Yarn頁面
在進(jìn)行Flink的Yarn部署前需要配置HADOOP_HOME,此處包括JAVA_HOME
vim /etc/profile
JAVA_HOME=/home/soft/jdk1.8.0_161 HADOOP_HOME=/home/soft/hadoop-2.8.1 JRE_HOME=${JAVA_HOME}/jre CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:$PATH export JAVA_HOME HADOOP_HOME PATH
保存后,source /etc/profile
第一種Yarn部署
在flink的bin目錄下
./yarn-session.sh -n 1 -jm 1024m -tm 1024m
-n : taskManager的數(shù)量
-jm: jobManager的內(nèi)存
-tm: taskManager的內(nèi)存
此時在Yarn的Web頁面(8088端口)可以看到
在我們的訪問機(jī)上的/etc/hosts配置好host1的IP地址后,點(diǎn)擊ApplicationMaster進(jìn)入Flink的管理頁面
提交代碼任務(wù)
上傳一份文件到hdfs的根目錄
hdfs dfs -put LICENSE-2.0.txt /
提交代碼任務(wù),在flink的bin目錄下
./flink run ../examples/batch/WordCount.jar -input hdfs://host1:8020/LICENSE-2.0.txt -output hdfs://host1:8020/wordcount-result.txt
運(yùn)算完成后,查看hdfs的文件
[bin]# hdfs dfs -ls / Found 4 items -rw-r--r-- 3 root supergroup 11358 2021-01-24 14:47 /LICENSE-2.0.txt drwxr-xr-x - root supergroup 0 2021-01-24 09:43 /abcd drwxr-xr-x - root supergroup 0 2021-01-24 14:17 /user -rw-r--r-- 3 root supergroup 4499 2021-01-24 15:06 /wordcount-result.txt
在Flink的頁面也可以看到
第二種Yarn部署
要進(jìn)行第二種Yarn部署,我們需要先取消第一種的配置
yarn application -kill application_1611471412139_0001
在flink的bin目錄下
./flink run -m yarn-cluster -yn 1 ../examples/batch/WordCount.jar
-m : yarn集群,yarn-cluster為常量
-yn: taskManager的數(shù)量
此時在Yarn的Web界面也可以看到
提交我們自己的任務(wù),將代碼socket的IP改成host1
public class StreamingJavaApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStreamSource<String> text = env.socketTextStream("host1",9999); text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {@Override public void flatMap(String value, Collector<Tuple2<String, Integer>> collector) throws Exception { String[] tokens = value.toLowerCase().split(" "); for (String token : tokens) { collector.collect(new Tuple2<>(token,1)); } } }).keyBy(0).timeWindow(Time.seconds(5)) .sum(1).print(); env.execute("StreamingJavaApp"); } }
啟動nc
nc -lk 9999
./flink run -m yarn-cluster -yn 1 ../test/flink-train-java-1.0.jar
輸入
[~]# nc -lk 9999 a v d t e a d e f a v d t e a d e f a v d t e a d e f a v d t e a d e f
查看結(jié)果
在Flink的Web界面上
State是指某一個具體的Task/Operator的狀態(tài)
State數(shù)據(jù)默認(rèn)存放在JVM中
分類:Keyed State & Operator State
Keyed State
/** * 從一組數(shù)據(jù)中,每兩個數(shù)據(jù)統(tǒng)計(jì)一次平均值 */public class KeyedStateApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.fromCollection(Arrays.asList(new Tuple2<>(1,3), new Tuple2<>(1,5), new Tuple2<>(1,7), new Tuple2<>(1,4), new Tuple2<>(1,2))) .keyBy(ele -> ele.getField(0)) .flatMap(new RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>() {private transient ValueState<Tuple2<Integer,Integer>> state; @Override public void open(Configuration parameters) throws Exception {super.open(parameters); state = getRuntimeContext().getState(new ValueStateDescriptor<>("avg", TypeInformation.of(new TypeHint<Tuple2<Integer, Integer>>() { }))); }@Override public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple2<Integer, Integer>> out) throws Exception { Tuple2<Integer, Integer> tmpState = state.value(); Tuple2<Integer,Integer> currentState = tmpState == null ? Tuple2.of(0,0) : tmpState; Tuple2<Integer,Integer> newState = new Tuple2<>((int) currentState.getField(0) + 1,(int) currentState.getField(1) + (int) value.getField(1)); state.update(newState); if ((int) newState.getField(0) >= 2) { out.collect(new Tuple2<>(value.getField(0),(int) newState.getField(1) / (int) newState.getField(0))); state.clear(); } } }).print().setParallelism(1); env.execute("KeyedStateApp"); } }
運(yùn)行結(jié)果
(1,4) (1,5)
Scala代碼
import org.apache.flink.api.common.functions.RichFlatMapFunctionimport org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}import org.apache.flink.configuration.Configurationimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.util.Collectorimport org.apache.flink.api.scala.createTypeInformationobject KeyedStateApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment env.fromCollection(List((1,3),(1,5),(1,7),(1,4),(1,2))) .keyBy(_._1) .flatMap(new RichFlatMapFunction[(Int,Int),(Int,Int)] {var state: ValueState[(Int,Int)] = _override def open(parameters: Configuration): Unit = { state = getRuntimeContext.getState(new ValueStateDescriptor[(Int, Int)]("avg", createTypeInformation[(Int,Int)])) }override def flatMap(value: (Int, Int), out: Collector[(Int, Int)]) = { val tmpState = state.value() val currentState = if (tmpState != null) { tmpState } else { (0,0) } val newState = (currentState._1 + 1,currentState._2 + value._2) state.update(newState) if (newState._1 >= 2) { out.collect((value._1,newState._2 / newState._1))state.clear() } } }).print().setParallelism(1) env.execute("KeyedStateApp") } }
運(yùn)行結(jié)果
(1,4) (1,5)
Reducing State
/** * 統(tǒng)計(jì)數(shù)據(jù)條數(shù),并加總 */public class ReducingStateApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.fromCollection(Arrays.asList(new Tuple2<>(1,3), new Tuple2<>(1,5), new Tuple2<>(1,7), new Tuple2<>(1,4), new Tuple2<>(1,2))) .keyBy(ele -> ele.getField(0)) .flatMap(new RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>() {private transient ReducingState<Tuple2<Integer,Integer>> state; @Override public void open(Configuration parameters) throws Exception {super.open(parameters); state = getRuntimeContext().getReducingState(new ReducingStateDescriptor<>("sum", new ReduceFunction<Tuple2<Integer, Integer>>() {@Override public Tuple2<Integer, Integer> reduce(Tuple2<Integer, Integer> value1, Tuple2<Integer, Integer> value2) throws Exception { Tuple2<Integer,Integer> tuple2 = new Tuple2<>((int) value1.getField(0) + 1, (int) value1.getField(1) + (int) value2.getField(1)); return tuple2; } }, TypeInformation.of(new TypeHint<Tuple2<Integer, Integer>>() {}))); }@Override public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple2<Integer, Integer>> out) throws Exception { Tuple2<Integer,Integer> tuple2 = new Tuple2<>(value.getField(0), value.getField(1)); state.add(tuple2); out.collect(new Tuple2<>(state.get().getField(0),state.get().getField(1))); } }).print().setParallelism(1); env.execute("ReducingStateApp"); } }
運(yùn)行結(jié)果
(2,8) (3,15) (4,19) (5,21)
Scala代碼
import org.apache.flink.api.common.functions.{ReduceFunction, RichFlatMapFunction}import org.apache.flink.api.common.state.{ReducingState, ReducingStateDescriptor}import org.apache.flink.api.scala.createTypeInformationimport org.apache.flink.configuration.Configurationimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.util.Collectorobject ReducingStateApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment env.fromCollection(List((1,3),(1,5),(1,7),(1,4),(1,2))) .keyBy(_._1) .flatMap(new RichFlatMapFunction[(Int,Int),(Int,Int)] {var state: ReducingState[(Int,Int)] = _override def open(parameters: Configuration): Unit = { state = getRuntimeContext.getReducingState(new ReducingStateDescriptor[(Int, Int)]("sum", new ReduceFunction[(Int, Int)] { override def reduce(value1: (Int, Int), value2: (Int, Int)): (Int, Int) = { (value1._1 + 1,value1._2 + value2._2) } }, createTypeInformation[(Int,Int)])) }override def flatMap(value: (Int, Int), out: Collector[(Int, Int)]) = { val tuple2 = (value._1,value._2) state.add(tuple2) out.collect((state.get()._1,state.get()._2)) } }).print().setParallelism(1) env.execute("ReducingStateApp") } }
運(yùn)行結(jié)果
(2,8) (3,15) (4,19) (5,21)
List State
/** * 獲取每一條所在的位置 */public class ListStateApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.fromCollection(Arrays.asList(new Tuple2<>(1,3), new Tuple2<>(1,5), new Tuple2<>(1,7), new Tuple2<>(1,4), new Tuple2<>(1,2))) .keyBy(ele -> ele.getField(0)) .flatMap(new RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple3<Integer,Integer,Integer>>() {private transient ListState<Integer> state; @Override public void open(Configuration parameters) throws Exception {super.open(parameters); state = getRuntimeContext().getListState(new ListStateDescriptor<>("list", TypeInformation.of(new TypeHint<Integer>() {}))); }@Override public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple3<Integer, Integer,Integer>> out) throws Exception {state.add(value.getField(0)); Iterator<Integer> iterator = state.get().iterator(); Integer l = 0; while (iterator.hasNext()) { l += iterator.next(); } Tuple3<Integer,Integer,Integer> tuple3 = new Tuple3<>(value.getField(0),value.getField(1),l); out.collect(tuple3); } }).print().setParallelism(1); env.execute("ListStateApp"); } }
運(yùn)行結(jié)果
(1,3,1) (1,5,2) (1,7,3) (1,4,4) (1,2,5)
Scala代碼
import org.apache.flink.api.common.functions.RichFlatMapFunctionimport org.apache.flink.api.common.state.{ListState, ListStateDescriptor}import org.apache.flink.api.scala.createTypeInformationimport org.apache.flink.configuration.Configurationimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.util.Collectorobject ListStateApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment env.fromCollection(List((1,3),(1,5),(1,7),(1,4),(1,2))) .keyBy(_._1) .flatMap(new RichFlatMapFunction[(Int,Int),(Int,Int,Int)] {var state: ListState[Int] = _override def open(parameters: Configuration): Unit = { state = getRuntimeContext.getListState(new ListStateDescriptor[Int]("list", createTypeInformation[Int])); }override def flatMap(value: (Int, Int), out: Collector[(Int, Int, Int)]) = { state.add(value._1) val iterator = state.get().iterator() var l: Int = 0 while (iterator.hasNext) { l += iterator.next() } val tuple3 = (value._1,value._2,l) out.collect(tuple3) } }).print().setParallelism(1) env.execute("ListStateApp") } }
運(yùn)行結(jié)果
(1,3,1) (1,5,2) (1,7,3) (1,4,4) (1,2,5)
Fold State
/** * 從某個初始值開始統(tǒng)計(jì)條數(shù) */public class FoldStateApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.fromCollection(Arrays.asList(new Tuple2<>(1,3), new Tuple2<>(1,5), new Tuple2<>(1,7), new Tuple2<>(1,4), new Tuple2<>(1,2))) .keyBy(ele -> ele.getField(0)) .flatMap(new RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple3<Integer,Integer,Integer>>() {private transient FoldingState<Integer,Integer> state; @Override public void open(Configuration parameters) throws Exception {super.open(parameters); state = getRuntimeContext().getFoldingState(new FoldingStateDescriptor<Integer, Integer>("fold", 1, (accumulator, value) -> accumulator + value, TypeInformation.of(new TypeHint<Integer>() {}) )); }@Override public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple3<Integer, Integer,Integer>> out) throws Exception {state.add(value.getField(0)); out.collect(new Tuple3<>(value.getField(0),value.getField(1),state.get())); } }).print().setParallelism(1); env.execute("FoldStateApp"); } }
運(yùn)行結(jié)果
(1,3,2) (1,5,3) (1,7,4) (1,4,5) (1,2,6)
Scala代碼
import org.apache.flink.api.common.functions.{FoldFunction, RichFlatMapFunction}import org.apache.flink.api.common.state.{FoldingState, FoldingStateDescriptor}import org.apache.flink.api.scala.createTypeInformationimport org.apache.flink.configuration.Configurationimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.util.Collectorobject FoldStateApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment env.fromCollection(List((1,3),(1,5),(1,7),(1,4),(1,2))) .keyBy(_._1) .flatMap(new RichFlatMapFunction[(Int,Int),(Int,Int,Int)] {var state: FoldingState[Int,Int] = _override def open(parameters: Configuration): Unit = { state = getRuntimeContext.getFoldingState(new FoldingStateDescriptor[Int,Int]("fold", 1,new FoldFunction[Int,Int] { override def fold(accumulator: Int, value: Int) = { accumulator + value } }, createTypeInformation[Int])) }override def flatMap(value: (Int, Int), out: Collector[(Int, Int, Int)]) = { state.add(value._1) out.collect((value._1,value._2,state.get())) } }).print().setParallelism(1) env.execute("FoldStateApp") } }
運(yùn)行結(jié)果
(1,3,2) (1,5,3) (1,7,4) (1,4,5) (1,2,6)
Map State
/** * 將每一條的數(shù)據(jù)加上上一條的數(shù)據(jù),第一條保持自身 */public class MapStateApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.fromCollection(Arrays.asList(new Tuple2<>(1,3), new Tuple2<>(1,5), new Tuple2<>(1,7), new Tuple2<>(1,4), new Tuple2<>(1,2))) .keyBy(ele -> ele.getField(0)) .flatMap(new RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>() {private transient MapState<Integer,Integer> state; @Override public void open(Configuration parameters) throws Exception {super.open(parameters); state = getRuntimeContext().getMapState(new MapStateDescriptor<>("map", TypeInformation.of(new TypeHint<Integer>() {}), TypeInformation.of(new TypeHint<Integer>() {}))); }@Override public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple2<Integer, Integer>> out) throws Exception { Integer tmp = state.get(value.getField(0)); Integer current = tmp == null ? 0 : tmp; state.put(value.getField(0),value.getField(1)); Tuple2<Integer,Integer> tuple2 = new Tuple2<>(value.getField(0), current + (int) value.getField(1)); out.collect(tuple2); } }).print().setParallelism(1); env.execute("MapStateApp"); } }
運(yùn)行結(jié)果
(1,3) (1,8) (1,12) (1,11) (1,6)
Scala代碼
import org.apache.flink.api.common.functions.RichFlatMapFunctionimport org.apache.flink.api.common.state.{MapState, MapStateDescriptor}import org.apache.flink.configuration.Configurationimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.util.Collectorimport org.apache.flink.api.scala.createTypeInformationobject MapStateApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment env.fromCollection(List((1,3),(1,5),(1,7),(1,4),(1,2))) .keyBy(_._1) .flatMap(new RichFlatMapFunction[(Int,Int),(Int,Int)] {var state: MapState[Int,Int] = _override def open(parameters: Configuration): Unit = { state = getRuntimeContext.getMapState(new MapStateDescriptor[Int,Int]("map", createTypeInformation[Int],createTypeInformation[Int])) }override def flatMap(value: (Int, Int), out: Collector[(Int, Int)]) = { val tmp: Int = state.get(value._1) val current: Int = if (tmp == null) {0 } else { tmp } state.put(value._1,value._2) val tuple2 = (value._1,current + value._2) out.collect(tuple2) } }).print().setParallelism(1) env.execute("MapStateApp") } }
運(yùn)行結(jié)果
(1,3) (1,8) (1,12) (1,11) (1,6)
Aggregating State
/** * 求每一條數(shù)據(jù)跟之前所有數(shù)據(jù)的平均值 */public class AggregatingStateApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.fromCollection(Arrays.asList(new Tuple2<>(1,3), new Tuple2<>(1,5), new Tuple2<>(1,7), new Tuple2<>(1,4), new Tuple2<>(1,2))) .keyBy(ele -> ele.getField(0)) .flatMap(new RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>() {private transient AggregatingState<Integer,Integer> state; @Override public void open(Configuration parameters) throws Exception {super.open(parameters); state = getRuntimeContext().getAggregatingState(new AggregatingStateDescriptor<>("agg", new AggregateFunction<Integer, Tuple2<Integer,Integer>, Integer>() {@Override public Tuple2<Integer, Integer> createAccumulator() {return new Tuple2<>(0,0); }@Override public Tuple2<Integer, Integer> add(Integer value, Tuple2<Integer, Integer> accumulator) {return new Tuple2<>((int) accumulator.getField(0) + value, (int) accumulator.getField(1) + 1); }@Override public Integer getResult(Tuple2<Integer, Integer> accumulator) {return (int) accumulator.getField(0) / (int) accumulator.getField(1); }@Override public Tuple2<Integer, Integer> merge(Tuple2<Integer, Integer> a, Tuple2<Integer, Integer> b) {return new Tuple2<>((int) a.getField(0) + (int) b.getField(0), (int) a.getField(1) + (int) b.getField(1)); } }, TypeInformation.of(new TypeHint<Tuple2<Integer,Integer>>() {}))); }@Override public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple2<Integer, Integer>> out) throws Exception {state.add(value.getField(1)); Tuple2<Integer,Integer> tuple2 = new Tuple2<>(value.getField(0), state.get()); out.collect(tuple2); } }).print().setParallelism(1); env.execute("AggregatingStateApp"); } }
運(yùn)行結(jié)果
(1,3) (1,4) (1,5) (1,4) (1,4)
Scala代碼
import org.apache.flink.api.common.functions.{AggregateFunction, RichFlatMapFunction}import org.apache.flink.api.common.state.{AggregatingState, AggregatingStateDescriptor}import org.apache.flink.api.scala.createTypeInformationimport org.apache.flink.configuration.Configurationimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.util.Collectorobject AggregatingStateApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment env.fromCollection(List((1,3),(1,5),(1,7),(1,4),(1,2))) .keyBy(_._1) .flatMap(new RichFlatMapFunction[(Int,Int),(Int,Int)] {var state: AggregatingState[Int,Int] = _override def open(parameters: Configuration): Unit = { state = getRuntimeContext.getAggregatingState(new AggregatingStateDescriptor[Int,(Int,Int),Int]("agg", new AggregateFunction[Int,(Int,Int),Int] { override def add(value: Int, accumulator: (Int, Int)) = { (accumulator._1 + value,accumulator._2 + 1) } override def createAccumulator() = { (0,0) } override def getResult(accumulator: (Int, Int)) = { accumulator._1 / accumulator._2 } override def merge(a: (Int, Int), b: (Int, Int)) = { (a._1 + b._1,a._2 + b._2) } },createTypeInformation[(Int,Int)])) }override def flatMap(value: (Int, Int), out: Collector[(Int, Int)]) = { state.add(value._2) val tuple2 = (value._1,state.get()) out.collect(tuple2) } }).print().setParallelism(1) env.execute("AggregatingStateApp") } }
運(yùn)行結(jié)果
(1,3) (1,4) (1,5) (1,4) (1,4)
Checkpoint機(jī)制
Flink中的每一個算子都能成為有狀態(tài)的,為了使得狀態(tài)能夠容錯,持久化狀態(tài),就有了Checkpoint機(jī)制。Checkpoint能夠恢復(fù)狀態(tài)以及在流中消費(fèi)的位置,提供一種無故障執(zhí)行的方式。
默認(rèn)情況下,checkpoint機(jī)制是禁用的,需要我們手動開啟。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();//開啟Checkpoint,間隔時間4秒進(jìn)行一次Checkpointenv.enableCheckpointing(4000);//設(shè)置Checkpoint的模式,精準(zhǔn)一次,也是Checkpoint默認(rèn)的方式,適合大部分應(yīng)用,//還有一種CheckpointingMode.AT_LEAST_ONCE最少一次,一般用于超低延遲的場景env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);//設(shè)置Checkpoint的超時時間,這里是10秒env.getCheckpointConfig().setCheckpointTimeout(10000);//設(shè)置Checkpoint的并發(fā)數(shù),可以1個,可以多個env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
public class CheckpointApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(5000); DataStreamSource<String> stream = env.socketTextStream("127.0.0.1", 9999); stream.map(x -> {if (x.contains("pk")) {throw new RuntimeException("出bug了..."); }else {return x; } }).print().setParallelism(1); env.execute("CheckpointApp"); } }
按照一般的情況,如果我們沒有開啟nc -lk 9999,則程序會直接掛掉,但是我們這里開啟了Checkpoint,此時雖然9999端口沒有開啟,但它會一直試圖連接9999端口,并不會掛掉,而Checkpoint的重試次數(shù)為Integer.MAX_VALUE,所以我們會一直看到這樣的日志
java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.flink.streaming.api.functions.source.SocketTextStreamFunction.run(SocketTextStreamFunction.java:96) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:58) at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704) at java.lang.Thread.run(Thread.java:748)
Scala代碼
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.api.scala._object CheckpointApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment env.enableCheckpointing(5000)val stream = env.socketTextStream("127.0.0.1",9999) stream.map(x => { if (x.contains("pk")) {throw new RuntimeException("出bug了...") } else { x } }).print().setParallelism(1) env.execute("CheckpointApp") } }
重啟策略
就像我們剛才看到的,如果不設(shè)置重啟策略,則Checkpoint會有一個默認(rèn)的重啟策略,次數(shù)為Integer.MAX_VALUE,延遲為1秒。如果我們只想重啟兩次,就需要設(shè)置重啟策略,重啟策略的設(shè)置可以在Flink的配置文件中設(shè)置,也可以在代碼中設(shè)置
如在flink的conf目錄下編輯flink-conf.yaml
restart-strategy.fixed-delay.attempts: 2 restart-strategy.fixed-delay.delay: 5 s
失敗后重啟次數(shù)2,延遲時間間隔5秒
代碼中設(shè)置
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.enableCheckpointing(5000);env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS)));
這里需要注意的是使用重啟策略,必須開啟Checkpoint機(jī)制,否則無效
public class CheckpointApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(5000); env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS))); DataStreamSource<String> stream = env.socketTextStream("127.0.0.1", 9999); stream.map(x -> {if (x.contains("pk")) {throw new RuntimeException("出bug了..."); }else {return x; } }).print().setParallelism(1); env.execute("CheckpointApp"); } }
當(dāng)我們打開nc -lk 9999,再運(yùn)行該程序,當(dāng)我們在控制臺輸出2次pk,程序雖然會拋出異常
java.lang.RuntimeException: 出bug了... at com.guanjian.flink.java.test.CheckpointApp.lambda$main$95f17bfa$1(CheckpointApp.java:18) at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41) at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704) at java.lang.Thread.run(Thread.java:748)
但不會掛掉,當(dāng)我們輸入第三次pk的時候,程序就會徹底掛掉
Scala代碼
import java.util.concurrent.TimeUnitimport org.apache.flink.api.common.restartstrategy.RestartStrategiesimport org.apache.flink.api.common.time.Timeimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.api.scala._object CheckpointApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment env.enableCheckpointing(5000) env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS)))val stream = env.socketTextStream("127.0.0.1",9999) stream.map(x => { if (x.contains("pk")) {throw new RuntimeException("出bug了...") } else { x } }).print().setParallelism(1) env.execute("CheckpointApp") } }
StateBackend
默認(rèn)情況下,Checkpoint的State是存儲在內(nèi)存中,一旦我們的程序掛掉了,重新啟動,那么之前的狀態(tài)都會丟失,比方說之前我們在nc中輸入了
a,a,a
以之前的CheckpointApp來說,我們稍作修改
public class CheckpointApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(5000); env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS))); DataStreamSource<String> stream = env.socketTextStream("127.0.0.1", 9999); stream.map(x -> {if (x.contains("pk")) {throw new RuntimeException("出bug了..."); }else {return x; } }).flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {@Override public void flatMap(String value, Collector<Tuple2<String,Integer>> out) throws Exception { String[] splits = value.split(","); Stream.of(splits).forEach(token -> out.collect(new Tuple2<>(token,1))); } }).keyBy(0).sum(1) .print().setParallelism(1); env.execute("CheckpointApp"); } }
運(yùn)行結(jié)果為
(a,1) (a,2) (a,3)
這個是沒有問題的,現(xiàn)在一旦程序掛掉,再次啟動程序的時候,我們再做相同的處理,結(jié)果不變。
但如果我們并不希望這樣的結(jié)果,我們希望得到的結(jié)果是
(a,4) (a,5) (a,6)
保留之前掛掉前的結(jié)果繼續(xù)累加
public class CheckpointApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(5000); //非內(nèi)存的外部擴(kuò)展 env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); //State以文件方式存儲 env.setStateBackend(new FsStateBackend("hdfs://172.18.114.236:8020/backend")); env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS))); DataStreamSource<String> stream = env.socketTextStream("host1", 9999); stream.map(x -> {if (x.contains("pk")) {throw new RuntimeException("出bug了..."); }else {return x; } }).flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {@Override public void flatMap(String value, Collector<Tuple2<String,Integer>> out) throws Exception { String[] splits = value.split(","); Stream.of(splits).forEach(token -> out.collect(new Tuple2<>(token,1))); } }).keyBy(0).sum(1) .print().setParallelism(1); env.execute("CheckpointApp"); } }
pom中調(diào)整運(yùn)行的主類
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass>com.guanjian.flink.java.test.CheckpointApp</mainClass></transformer>
打包上傳服務(wù)器flink的test的目錄
修改flink的conf目錄下的flink-conf.yaml,補(bǔ)充以下內(nèi)容
state.backend: filesystem state.checkpoints.dir: hdfs://172.18.114.236:8020/backend state.savepoints.dir: hdfs://172.18.114.236:8020/backend
在HDFS中新建backend目錄
hdfs dfs -mkdir /backend
重啟Flink,開啟
nc -lk 9999
第一次提交方式不變
./flink run -m yarn-cluster -yn 1 ../test/flink-train-java-1.0.jar
繼續(xù)之前的輸入
a,a,a
此時停掉flink提交的程序,會在hdfs中發(fā)現(xiàn)一個很多數(shù)字的文件夾
現(xiàn)在我們再次啟動程序,不過跟之前有些不同
./flink run -s hdfs://172.18.114.236:8020/backend/4db93b564e17b3806230f7c2d053121e/chk-5 -m yarn-cluster -yn 1 ../test/flink-train-java-1.0.jar
此時在nc中繼續(xù)輸入
a,a,a
運(yùn)行結(jié)果就達(dá)到了我們的預(yù)期
Scala代碼
import java.util.concurrent.TimeUnitimport org.apache.flink.api.common.restartstrategy.RestartStrategiesimport org.apache.flink.api.common.time.Timeimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.runtime.state.filesystem.FsStateBackendimport org.apache.flink.streaming.api.environment.CheckpointConfigobject CheckpointApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment env.enableCheckpointing(5000) env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION) env.setStateBackend(new FsStateBackend("hdfs://172.18.114.236:8020/backend")) env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS)))val stream = env.socketTextStream("host1",9999) stream.map(x => { if (x.contains("pk")) {throw new RuntimeException("出bug了...") } else { x } }).flatMap(_.split(",")) .map((_,1)) .keyBy(0) .sum(1) .print().setParallelism(1) env.execute("CheckpointApp") } }
RocksDBStateBackend
要使用RocksDBBackend需要先添加依賴
<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-statebackend-rocksdb_2.11</artifactId> <version>${flink.version}</version></dependency>
public class CheckpointApp {public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(5000); //非內(nèi)存的外部擴(kuò)展 env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); //State以RockDB數(shù)據(jù)庫存儲,并刷到hdfs上面去 env.setStateBackend(new RocksDBStateBackend("hdfs://172.18.114.236:8020/backend/rocksDB",true)); env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS))); DataStreamSource<String> stream = env.socketTextStream("host1", 9999); stream.map(x -> {if (x.contains("pk")) {throw new RuntimeException("出bug了..."); }else {return x; } }).flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {@Override public void flatMap(String value, Collector<Tuple2<String,Integer>> out) throws Exception { String[] splits = value.split(","); Stream.of(splits).forEach(token -> out.collect(new Tuple2<>(token,1))); } }).keyBy(0).sum(1) .print().setParallelism(1); env.execute("CheckpointApp"); } }
打包上傳服務(wù)器flink的test目錄下
創(chuàng)建hdfs的目錄
hdfs dfs -mkdir /backend/rocksDB
配置flink的flink-conf.yaml,修改和添加以下內(nèi)容
state.backend: rocksdb state.checkpoints.dir: hdfs://172.18.114.236:8020/backend/rocksDB state.savepoints.dir: hdfs://172.18.114.236:8020/backend/rocksDB state.backend.incremental: true state.backend.rocksdb.checkpoint.transfer.thread.num: 1 state.backend.rocksdb.localdir: /raid/db/flink/checkpoints state.backend.rocksdb.timer-service.factory: HEAP
重啟Flink.執(zhí)行
nc -lk 9999
第一次提交方式不變
./flink run -m yarn-cluster -yn 1 ../test/flink-train-java-1.0.jar
繼續(xù)之前的輸入
a,a,a
此時停掉flink提交的程序,會在hdfs中發(fā)現(xiàn)一個很多數(shù)字的文件夾
在某臺集群服務(wù)器上,這里只能說是某臺,不一定是你提交任務(wù)的那臺服務(wù)器,可以看到rocksdb的本地數(shù)據(jù)文件
rocksdbbackend是先將數(shù)據(jù)存儲到該處,再刷到hdfs中的
再次啟動程序
./flink run -s hdfs://172.18.114.236:8020/backend/rocksDB/6277c8adfba91c72baa384a0d23581d9/chk-64 -m yarn-cluster -yn 1 ../test/flink-train-java-1.0.jar
此時輸入
a,a,a
此時我們?nèi)ビ^察結(jié)果跟之前相同
Scala代碼
import java.util.concurrent.TimeUnitimport org.apache.flink.api.common.restartstrategy.RestartStrategiesimport org.apache.flink.api.common.time.Timeimport org.apache.flink.streaming.api.scala.StreamExecutionEnvironmentimport org.apache.flink.api.scala._import org.apache.flink.contrib.streaming.state.RocksDBStateBackendimport org.apache.flink.streaming.api.environment.CheckpointConfigobject CheckpointApp { def main(args: Array[String]): Unit = {val env = StreamExecutionEnvironment.getExecutionEnvironment env.enableCheckpointing(5000) env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION) env.setStateBackend(new RocksDBStateBackend("hdfs://172.18.114.236:8020/backend/rocksDB",true)) env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS)))val stream = env.socketTextStream("host1",9999) stream.map(x => { if (x.contains("pk")) {throw new RuntimeException("出bug了...") } else { x } }).flatMap(_.split(",")) .map((_,1)) .keyBy(0) .sum(1) .print().setParallelism(1) env.execute("CheckpointApp") } }
HistoryServer
HistoryServer是用來查看已經(jīng)運(yùn)行過的Job的信息
在flink的conf目錄下編輯flink-conf.yaml,添加一下內(nèi)容
jobmanager.archive.fs.dir: hdfs://172.18.114.236:8020/completed-jobs/ historyserver.web.address: 0.0.0.0 historyserver.web.port: 8082 historyserver.archive.fs.dir: hdfs://172.18.114.236:8020/completed-jobs/ historyserver.archive.fs.refresh-interval: 10000
在bin目錄下運(yùn)行
./historyserver.sh start
在瀏覽器中訪問 外網(wǎng)ip:8082,可以看到一個Web界面(剛進(jìn)來的時候這里是沒有內(nèi)容的,我這里是運(yùn)行了一個Job以后留下的)
照例,我們運(yùn)行一個任務(wù),結(jié)束后,可以看到以下的信息
在hdfs中也可以看到任務(wù)保留下來的信息
信息有提供REST API接口可以用Json格式進(jìn)行訪問,例如
“Flink技術(shù)的使用方法有哪些”的內(nèi)容就介紹到這里了,感謝大家的閱讀。如果想了解更多行業(yè)相關(guān)的知識可以關(guān)注億速云網(wǎng)站,小編將為大家輸出更多高質(zhì)量的實(shí)用文章!
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進(jìn)行舉報,并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。