• TiSpark 快速入门指南
    • 部署信息
    • 环境准备
      • 在 TiDB 实例上安装 JDK
      • 导入样例数据
    • 使用范例

    TiSpark 快速入门指南

    为了让大家快速体验 TiSpark,通过 TiDB-Ansible 安装的 Pre-GA 或 master 版本 TiDB 集群中默认已集成 Spark、TiSpark jar 包及 TiSpark sample data。

    部署信息

    • Spark 默认部署在 TiDB 实例部署目录下 spark 目录中
    • TiSpark jar 包默认部署在 Spark 部署目录 jars 文件夹下:

      spark/jars/tispark-0.1.0-beta-SNAPSHOT-jar-with-dependencies.jar

    • TiSpark sample data 及导入脚本默认部署在 TiDB-Ansible 目录下:

      tidb-ansible/resources/bin/tispark-sample-data

    环境准备

    在 TiDB 实例上安装 JDK

    在 Oracle JDK 官方下载页面 下载 JDK 1.8 当前最新版,本示例中下载的版本为 jdk-8u141-linux-x64.tar.gz

    解压并根据您的 JDK 部署目录设置环境变量,
    编辑 ~/.bashrc 文件,比如:

    1. export JAVA_HOME=/home/pingcap/jdk1.8.0_144
    2. export PATH=$JAVA_HOME/bin:$PATH

    验证 JDK 有效性:

    1. $ java -version
    2. java version "1.8.0_144"
    3. Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
    4. Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

    导入样例数据

    假设 TiDB 集群已启动,其中一台 TiDB 实例服务 IP 为 192.168.0.2,端口为 4000,用户名为 root, 密码为空。

    1. cd tidb-ansible/resources/bin/tispark-sample-data

    修改 sample_data.sh 中 TiDB 登录信息,比如:

    1. mysql -h 192.168.0.2 -P 4000 -u root < dss.ddl

    执行脚本

    1. ./sample_data.sh

    执行脚本的机器上需要安装 MySQL client,CentOS 用户可通过 yum -y install mysql来安装。

    登录 TiDB 并验证数据包含 TPCH_001 库及以下表:

    1. $ mysql -uroot -P4000 -h192.168.0.2
    2. MySQL [(none)]> show databases;
    3. +--------------------+
    4. | Database |
    5. +--------------------+
    6. | INFORMATION_SCHEMA |
    7. | PERFORMANCE_SCHEMA |
    8. | TPCH_001 |
    9. | mysql |
    10. | test |
    11. +--------------------+
    12. 5 rows in set (0.00 sec)
    13. MySQL [(none)]> use TPCH_001
    14. Reading table information for completion of table and column names
    15. You can turn off this feature to get a quicker startup with -A
    16. Database changed
    17. MySQL [TPCH_001]> show tables;
    18. +--------------------+
    19. | Tables_in_TPCH_001 |
    20. +--------------------+
    21. | CUSTOMER |
    22. | LINEITEM |
    23. | NATION |
    24. | ORDERS |
    25. | PART |
    26. | PARTSUPP |
    27. | REGION |
    28. | SUPPLIER |
    29. +--------------------+
    30. 8 rows in set (0.00 sec)

    使用范例

    假设您的 PD 节点 IP 为 192.168.0.2,端口 2379, 先进入 spark 部署目录启动 spark-shell:

    1. $ cd spark
    2. $ bin/spark-shell
    1. scala> import org.apache.spark.sql.TiContext
    2. scala> val ti = new TiContext(spark)
    3. // Mapping all TiDB tables from `TPCH_001` database as Spark SQL tables
    4. scala> ti.tidbMapDatabase("TPCH_001")

    之后您可以直接调用 Spark SQL:

    1. scala> spark.sql("select count(*) from lineitem").show

    结果为

    1. +--------+
    2. |count(1)|
    3. +--------+
    4. | 60175|
    5. +--------+

    下面执行另一个复杂一点的 Spark SQL:

    1. scala> spark.sql(
    2. """select
    3. | l_returnflag,
    4. | l_linestatus,
    5. | sum(l_quantity) as sum_qty,
    6. | sum(l_extendedprice) as sum_base_price,
    7. | sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    8. | sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    9. | avg(l_quantity) as avg_qty,
    10. | avg(l_extendedprice) as avg_price,
    11. | avg(l_discount) as avg_disc,
    12. | count(*) as count_order
    13. |from
    14. | lineitem
    15. |where
    16. | l_shipdate <= date '1998-12-01' - interval '90' day
    17. |group by
    18. | l_returnflag,
    19. | l_linestatus
    20. |order by
    21. | l_returnflag,
    22. | l_linestatus
    23. """.stripMargin).show

    结果为:

    1. +------------+------------+---------+--------------+--------------+
    2. |l_returnflag|l_linestatus| sum_qty|sum_base_price|sum_disc_price|
    3. +------------+------------+---------+--------------+--------------+
    4. | A| F|380456.00| 532348211.65|505822441.4861|
    5. | N| F| 8971.00| 12384801.37| 11798257.2080|
    6. | N| O|742802.00| 1041502841.45|989737518.6346|
    7. | R| F|381449.00| 534594445.35|507996454.4067|
    8. +------------+------------+---------+--------------+--------------+
    9. (续)
    10. -----------------+---------+------------+--------+-----------+
    11. sum_charge| avg_qty| avg_price|avg_disc|count_order|
    12. -----------------+---------+------------+--------+-----------+
    13. 526165934.000839|25.575155|35785.709307|0.050081| 14876|
    14. 12282485.056933|25.778736|35588.509684|0.047759| 348|
    15. 1029418531.523350|25.454988|35691.129209|0.049931| 29181|
    16. 528524219.358903|25.597168|35874.006533|0.049828| 14902|
    17. -----------------+---------+------------+--------+-----------+

    更多样例请参考 https://github.com/ilovesoup/tpch/tree/master/sparksql 。