spark 累加历史统计全部行转列

spark 累加历史主要用到了窗口函数，而进行全部统计，则需要用到rollup函数

1 应用场景：

　　1、我们需要统计用户的总使用时长（累加历史）

　　2、前台展现页面需要对多个维度进行查询，如：产品、地区等等

　　3、需要展现的表格头如：产品、2015-04、2015-05、2015-06

2 原始数据：

product_code |event_date |duration |
-------------|-----------|---------|
1438         |2016-05-13 |165      |
1438         |2016-05-14 |595      |
1438         |2016-05-15 |105      |
1629         |2016-05-13 |12340    |
1629         |2016-05-14 |13850    |
1629         |2016-05-15 |227      |

3 业务场景实现

3.1 业务场景1：累加历史：

如数据源所示：我们已经有当天用户的使用时长，我们期望在进行统计的时候，14号能累加13号的，15号能累加14、13号的，以此类推

3.1.1 spark-sql实现

//spark sql 使用窗口函数累加历史数据
sqlcontext.sql(
"""
  select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration
  from userlogs_date
""").show
 ----- ---------- ------------                                                  
|pcode|event_date|sum_duration|
 ----- ---------- ------------ 
| 1438|2016-05-13|         165|
| 1438|2016-05-14|         760|
| 1438|2016-05-15|         865|
| 1629|2016-05-13|       12340|
| 1629|2016-05-14|       26190|
| 1629|2016-05-15|       26417|
 ----- ---------- ------------

3.1.2 dataframe实现

//使用column提供的over 函数，传入窗口操作
import org.apache.spark.sql.expressions._
val first_2_now_window = window.partitionby("pcode").orderby("event_date")
df_userlogs_date.select(
    $"pcode",
    $"event_date",
    sum($"duration").over(first_2_now_window).as("sum_duration")
).show
 ----- ---------- ------------                                                  
|pcode|event_date|sum_duration|
 ----- ---------- ------------ 
| 1438|2016-05-13|         165|
| 1438|2016-05-14|         760|
| 1438|2016-05-15|         865|
| 1629|2016-05-13|       12340|
| 1629|2016-05-14|       26190|
| 1629|2016-05-15|       26417|
 ----- ---------- ------------

3.1.3 扩展累加一段时间范围内

实际业务中的累加逻辑远比上面复杂，比如，累加之前n天，累加前n天到后n天等等。以下我们来实现：

　3.1.3.1 累加历史所有：

select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date
window.partitionby("pcode").orderby("event_date").rowsbetween(long.minvalue,0)
window.partitionby("pcode").orderby("event_date")
上边四种写法完全相等

3.1.3.2 累加n天之前，假设n=3

select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 3 preceding and current row) as sum_duration from userlogs_date
window.partitionby("pcode").orderby("event_date").rowsbetween(-3,0)

3.1.3.3 累加前n天，后m天: 假设n=3 m=5

select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 3 preceding and 5 following ) as sum_duration from userlogs_date
window.partitionby("pcode").orderby("event_date").rowsbetween(-3,5)
3.1.3.4 累加该分区内所有行

select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and unbounded following ) as sum_duration from userlogs_date

window.partitionby("pcode").orderby("event_date").rowsbetween(long.minvalue,long.maxvalue)

总结如下：

preceding：用于累加前n行（分区之内）。若是从分区第一行头开始，则为 unbounded。 n为：相对当前行向前的偏移量
following ：与preceding相反，累加后n行（分区之内）。若是累加到该分区结束，则为 unbounded。n为：相对当前行向后的偏移量
current row：顾名思义，当前行，偏移量为0

说明：上边的前n，后m，以及current row均会累加该偏移量所在行

3.1.3.4 实测结果

累加历史：分区内当天及之前所有 写法1：select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date

 ----- ---------- ------------                                                  
|pcode|event_date|sum_duration|
 ----- ---------- ------------ 
| 1438|2016-05-13|         165|
| 1438|2016-05-14|         760|
| 1438|2016-05-15|         865|
| 1629|2016-05-13|       12340|
| 1629|2016-05-14|       26190|
| 1629|2016-05-15|       26417|
 ----- ---------- ------------

累加历史：分区内当天及之前所有 写法2：select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date

 ----- ---------- ------------                                                  
|pcode|event_date|sum_duration|
 ----- ---------- ------------ 
| 1438|2016-05-13|         165|
| 1438|2016-05-14|         760|
| 1438|2016-05-15|         865|
| 1629|2016-05-13|       12340|
| 1629|2016-05-14|       26190|
| 1629|2016-05-15|       26417|
 ----- ---------- ------------

累加当日和昨天：select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 1 preceding and current row) as sum_duration from userlogs_date

 ----- ---------- ------------                                                  
|pcode|event_date|sum_duration|
 ----- ---------- ------------ 
| 1438|2016-05-13|         165|
| 1438|2016-05-14|         760|
| 1438|2016-05-15|         700|
| 1629|2016-05-13|       12340|
| 1629|2016-05-14|       26190|
| 1629|2016-05-15|       14077|
 ----- ---------- ------------

累加当日、昨日、明日：select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 1 preceding and 1 following ) as sum_duration from userlogs_date

 ----- ---------- ------------                                                  
|pcode|event_date|sum_duration|
 ----- ---------- ------------ 
| 1438|2016-05-13|         760|
| 1438|2016-05-14|         865|
| 1438|2016-05-15|         700|
| 1629|2016-05-13|       26190|
| 1629|2016-05-14|       26417|
| 1629|2016-05-15|       14077|
 ----- ---------- ------------

累加分区内所有：当天和之前之后所有：select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and unbounded following ) as sum_duration from userlogs_date

 ----- ---------- ------------                                                  
|pcode|event_date|sum_duration|
 ----- ---------- ------------ 
| 1438|2016-05-13|         865|
| 1438|2016-05-14|         865|
| 1438|2016-05-15|         865|
| 1629|2016-05-13|       26417|
| 1629|2016-05-14|       26417|
| 1629|2016-05-15|       26417|
 ----- ---------- ------------

3.2 业务场景2：统计全部

3.2.1 spark sql实现

//spark sql 使用rollup添加all统计
sqlcontext.sql(
"""
  select pcode,event_date,sum(duration) as sum_duration
  from userlogs_date_1
  group by pcode,event_date with rollup
  order by pcode,event_date
""").show()
 ----- ---------- ------------                                                  
|pcode|event_date|sum_duration|
 ----- ---------- ------------ 
| null|      null|       27282|
| 1438|      null|         865|
| 1438|2016-05-13|         165|
| 1438|2016-05-14|         595|
| 1438|2016-05-15|         105|
| 1629|      null|       26417|
| 1629|2016-05-13|       12340|
| 1629|2016-05-14|       13850|
| 1629|2016-05-15|         227|
 ----- ---------- ------------

3.2.2 dataframe函数实现

//使用dataframe提供的rollup函数，进行多维度all统计
df_userlogs_date.rollup($"pcode", $"event_date").agg(sum($"duration")).orderby($"pcode", $"event_date")
 ----- ---------- -------------                                                 
|pcode|event_date|sum(duration)|
 ----- ---------- ------------- 
| null|      null|        27282|
| 1438|      null|          865|
| 1438|2016-05-13|          165|
| 1438|2016-05-14|          595|
| 1438|2016-05-15|          105|
| 1629|      null|        26417|
| 1629|2016-05-13|        12340|
| 1629|2016-05-14|        13850|
| 1629|2016-05-15|          227|
 ----- ---------- -------------

3.3 行转列 ->pivot

pivot目前还没有sql语法，先用df语法吧

val userlogs_date_all = sqlcontext.sql("select dcode, pcode,event_date,sum(duration) as duration from userlogs group by dognum, pcode,event_date ")
userlogs_date_all.registertemptable("userlogs_date_all")
val dates = userlogs_date_all.select($"event_date").map(row => row.getas[string]("event_date")).distinct().collect().tolist
userlogs_date_all.groupby($"dcode", $"pcode").pivot("event_date", dates).sum("duration").na.fill(0).show
 ----------------- ----- ---------- ---------- ---------- ---------- 
|            dcode|pcode|2016-05-26|2016-05-13|2016-05-14|2016-05-15|
 ----------------- ----- ---------- ---------- ---------- ---------- 
|         f2429186| 1438|         0|         0|       227|         0|
|        ai2342441| 1438|         0|         0|         0|       345|
|       a320018711| 1438|         0|       939|         0|         0|
|         h2635817| 1438|         0|       522|         0|         0|
|         d0288196| 1438|         0|       101|         0|         0|
|         y0242218| 1438|         0|      1036|         0|         0|
|         h2392574| 1438|         0|         0|       689|         0|
|         d2245588| 1438|         0|         0|         1|         0|
|         y2514906| 1438|         0|         0|       118|         4|
|         h2540419| 1438|         0|       465|       242|         5|
|         r2231926| 1438|         0|         0|       305|         0|
|         h2684591| 1438|         0|       136|         0|         0|
|         a2548470| 1438|         0|       412|         0|         0|
|         gh000309| 1438|         0|         0|         0|         4|
|         h2293216| 1438|         0|         0|         0|       534|
|         r2170601| 1438|         0|         0|         0|         0|
|b2365238;b2559538| 1438|         0|         0|         0|         0|
|         bq005465| 1438|         0|         0|       642|        78|
|        ah2180324| 1438|         0|       608|       146|        36|
|         h0279306| 1438|         0|       490|         0|         0|
 ----------------- ----- ---------- ---------- ---------- ----------

附录

下面是这两个函数的官方api说明：

org.apache.spark.sql.scala

def rollup(col1: string, cols: string*): groupeddata

create a multi-dimensional rollup for the current dataframe using the specified columns, so we can run aggregation on them. see groupeddata for all the available aggregate functions.
this is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// compute the average for all numeric columns rolluped by department and group.
df.rollup("department", "group").avg()
// compute the max age and average salary, rolluped by department and gender.
df.rollup($"department", $"gender").agg(map(
  "salary" -> "avg",
  "age" -> "max"
))

def rollup(cols: column*): groupeddata
create a multi-dimensional rollup for the current dataframe using the specified columns, so we can run aggregation on them. see groupeddata for all the available aggregate functions.
df.rollup($"department", $"group").avg()
// compute the max age and average salary, rolluped by department and gender.
df.rollup($"department", $"gender").agg(map(
  "salary" -> "avg",
  "age" -> "max"
))

org.apache.spark.sql.column.scala

def over(window: windowspec): column
define a windowing column.
val w = window.partitionby("name").orderby("id")
df.select(
  sum("price").over(w.rangebetween(long.minvalue, 2)),
  avg("price").over(w.rowsbetween(0, 4))
)

posted on 2017-10-23 22:05 xzc 阅读(791) 评论(0) 编辑收藏所属分类: hadoop

常用链接

留言簿(12)

spark 累加历史统计全部行转列 -凯发k8网页登录

搜索

最新评论

阅读排行榜

评论排行榜


只有注册用户后才能发表评论。




网站导航:
相关文章: hdfs副本设置——默认3 kafka节点假死 hive中reduce个数设定 spark 累加历史统计全部行转列 spark分析窗口函数 sparksql相关语句总结