spark1.4发布,支持了窗口分析函数(window functions)。在离线平台中,90%以上的离线分析任务都是使用hive实现,其中必然会使用很多窗口分析函数,如果sparksql支持窗口分析函数,
那么对于后面hive向sparksql中的迁移的工作量会大大降低,使用方式如下:
1、初始化数据
创建表
- create table window_test2 (url string, rate int) row format delimited fields terminated by ',';
准备测试数据
- url1,12
- url2,11
- url1,23
- url2,25
- url1,58
- url3,11
- url2,25
- url3,58
- url2,11
加载数据:
- load data local inpath '/opt/bin/short_opt/windows2.data' overwrite into table window_test2 ;
2、窗口函数测试
查询所有数据
- select * from window_test2;
------- ------- | url | rate | ------- ------- | url1 | 12 || url2 | 11 || url1 | 23 || url2 | 25 || url1 | 58 || url3 | 11 || url2 | 25 || url3 | 58 || url2 | 11 | ------- ------- 分组排序:
- select url,rate,row_number() over(partition by url order by rate desc) as r from window_test2;
------- ------- ---- | url | rate | r | ------- ------- ---- | url1 | 58 | 1 || url1 | 23 | 2 || url1 | 12 | 3 || url2 | 25 | 1 || url2 | 25 | 2 || url2 | 11 | 3 || url2 | 11 | 4 || url3 | 58 | 1 || url3 | 11 | 2 | ------- ------- ---- 分组统计sum
- select url,rate,sum(rate) over(partition by url ) as r from window_test2;
------- ------- ----- | url | rate | r | ------- ------- ----- | url1 | 12 | 93 || url1 | 23 | 93 || url1 | 58 | 93 || url2 | 11 | 72 || url2 | 25 | 72 || url2 | 25 | 72 || url2 | 11 | 72 || url3 | 11 | 69 || url3 | 58 | 69 | ------- ------- ----- 分组统计avg
- select url,rate,avg(rate) over(partition by url ) as r from window_test2;
------- ------- ------- | url | rate | r | ------- ------- ------- | url1 | 12 | 31.0 || url1 | 23 | 31.0 || url1 | 58 | 31.0 || url2 | 25 | 18.0 || url2 | 11 | 18.0 || url2 | 11 | 18.0 || url2 | 25 | 18.0 || url3 | 11 | 34.5 || url3 | 58 | 34.5 | ------- ------- ------- 分组统计count
- select url,rate,count(rate) over(partition by url ) as r from window_test2;
------- ------- ---- | url | rate | r | ------- ------- ---- | url1 | 12 | 3 || url1 | 23 | 3 || url1 | 58 | 3 || url2 | 11 | 4 || url2 | 25 | 4 || url2 | 25 | 4 || url2 | 11 | 4 || url3 | 11 | 2 || url3 | 58 | 2 | ------- ------- ---- 分组lag
- select url,rate,lag(rate) over(partition by url ) as r from window_test2;
------- ------- ------- | url | rate | r | ------- ------- ------- | url1 | 12 | null || url1 | 23 | 12 || url1 | 58 | 23 || url2 | 25 | null || url2 | 11 | 25 || url2 | 11 | 11 || url2 | 25 | 11 || url3 | 11 | null || url3 | 58 | 11 | ------- ------- -------
3、spark-1.4以后,支持所有的窗口函数了,有利用于hive作业向spark-sql来转换。
posted on 2017-10-23 22:04
xzc 阅读(587)
评论(0) 编辑 收藏 所属分类:
hadoop