本文共 4373 字,大约阅读时间需要 14 分钟。
大搜车业务线众多,对于数据的需求也各种各样,本文将介绍其中之一的大搜车车商客户实时数据需求,例如车商PC|H5
端店铺、车辆、分享等实时流量数据报表;随着数据量级的增长,目前数据量级在亿级以上,原有以mysql提供查询服务不再适合此场景,经过多方面的考虑,存储最终选择Aliyun HBase,同时为了几乎0成本的切换,采用Phoenix On HBase Sql中间件,它管理着HBase的二级索引并且它对sql的支持友好,本文也将介绍Phoenix和HBase结合场景下的压力测试。
PC|H5
端访问日志,通过Flume收集这些日志,并按照业务场景需求,通过流试计算清洗、过滤、统计,使用Phoenix api实时推送到HBase。统一通过数据网关提供业务查询。
*架构图:*![图片](https://raw.githubusercontent.com/Danier-Evens/Markdown_Image/master/image/%E7%B3%BB%E7%BB%9F%E6%9E%B6%E6%9E%84.png)
数据准备
考虑到是一次性的工作,本次压测数据我采用Phoenix提供的脚本的方式导入数据。
数据表、sql模板、索引建立
表建立
CREATE TABLE FLOW.SHOP_DATA_BY_SALER_CAR_V2 ( PK varchar primary key, INFO.DATE_STR BIGINT, INFO.STORE_ID VARCHAR, INFO.CAR_ID VARCHAR, INFO.SELLER_ID VARCHAR, INFO.SHARE_PV INTEGER, INFO.SHARE_UV INTEGER, INFO.FLOW_PV INTEGER, INFO.FLOW_UV INTEGER, INFO.CALL_PV INTEGER, INFO.CALL_UV INTEGER, INFO.APPOINT_PV INTEGER, INFO.APPOINT_UV INTEGER, INFO.LAST_UPDATE_TIME DATE) COMPRESSION='SNAPPY',DATA_BLOCK_ENCODING='DIFF';
模拟真实查询sql模板,sql查询时间范围为1个月的数据。
SELECT info.seller_id, sum(info.share_pv) as sum_share_pv, sum(info.flow_pv) as sum_flow_pv, sum(info.call_pv) as sum_call_pv, sum(info.appoint_pv) as sum_appoint_pvFROM FLOW.SHOP_DATA_BY_SALER_CAR_V2WHERE info.store_id = '%s' AND info.date >= %d AND info.date <= %dGROUP BY info.seller_idORDER BY sum_share_pv DESC
SELECT sum(info.share_pv) as sum_share_pv, sum(info.flow_pv) as sum_flow_pv, sum(info.call_pv) as sum_call_pv, sum(info.appoint_pv) as sum_appoint_pvFROM FLOW.SHOP_DATA_BY_SALER_CAR_V2WHERE info.store_id = '%s' AND info.date >= %d AND info.date <= %d
SELECT sum(info.share_pv) as sum_share_pv, sum(info.flow_pv) as sum_flow_pv, sum(info.call_pv) as sum_call_pv, sum(info.appoint_pv) as sum_appoint_pvFROM FLOW.SHOP_DATA_BY_SALER_CAR_V2WHERE info.store_id = '%s' AND info.date >= %d AND info.date <= %d AND info.seller_id = '%s'
SELECT info.seller_id, sum(info.share_pv) sum_share_pv, sum(info.flow_pv) sum_flow_pvFROM FLOW.SHOP_DATA_BY_SALER_CAR_V2WHERE info.car_id = '%s' AND info.date >= %d AND info.date <= %d AND info.share_pv <> 0ORDER BY info.seller_id
针对sql模板场景,建立索引表,索引类为覆盖索引
CREATE INDEX SHOP_DATA_BY_SALER_CAR_V2_INDEX ON FLOW.SHOP_DATA_BY_SALER_CAR_V2 (INFO.STORE_ID, INFO.DATE_STR) INCLUDE (INFO.SELLER_ID, INFO.CAR_ID, INFO.SHARE_PV, INFO.FLOW_PV, INFO.CALL_PV, INFO.APPOINT_PV) COMPRESSION='SNAPPY',DATA_BLOCK_ENCODING='DIFF'
CREATE INDEX SHOP_DATA_BY_SALER_CAR_V2_INDEX1 ON FLOW.SHOP_DATA_BY_SALER_CAR_V2 (INFO.CAR_ID, INFO.DATE_STR, INFO.SHARE_PV) INCLUDE (INFO.SELLER_ID, INFO.FLOW_PV, INFO.CALL_PV, INFO.APPOINT_PV) COMPRESSION='SNAPPY',DATA_BLOCK_ENCODING='DIFF'
数据样例的选择:sql查询时间范围均为1个月,查询条件由挑选出这1个月中按车商、销售、车辆各个分组总条数在前300、300、300的数据按照模板随机组合查询。保证sql查询都能命中数据,同时也排除每次都是量很大的数据。数据样例见最后。测试表的数据量级在亿行以上。
系统情况
压测分别从10 ~ 100并发之前压测,以10为累加单位进行压测,压测时间为10分钟。目前我们业务场景每秒并发数在50 ~ 80左右,高峰期高于80。
以下挑选了并发数在100的时候应用GC、HBase系统负载情况。
数据样例, ps:数据集经过特殊处理。
车商数据集
010140548,001106040,001109847,001104443,001106241,001101049,001110943,001131047,001119549,001121749,001102043,001142748,001118444,001108248,001111340,001108240,001151942 .......
车辆数据集
Cg6f7hXsfkqffgWHsafk,adfu3fMhZsfkDffBFsa4kNadf,fcfchesfkdfg15saakcadf6d487b834e2b8eb81217c,72f8hOsfkRffgdIsaQkMadf,75fah8sfk3fg8fsa4kdadf7c4e03b9e46a864b858b6,59f0hesfk5ffg96sadk8adf934d72b4f8119f0e38acb .......
销售数据集
11234,11791,18782,13298,15889,13069,13213,18231,18988,18346,18946,13137,15051,15320,15680,15066,15512,13585,15555,13235,18195,13888,13363,13921,17777,18088,13188,15708 .....
转载地址:http://uizsl.baihongyu.com/