Hadoop大数据环境搭建v1.0

更新时间:2023-12-13 03:58:01 阅读量: 教育文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

基于Hadoop的大数据试验环境搭建

目录

1、操作系统搭建及网络拓扑图 ........................................................... 错误!未定义书签。 1.1、操作系统版本 .............................................................................................................. 3 1.2、硬件配置 ...................................................................................................................... 3 1.3、系统账户 ...................................................................................................................... 4 1.4、系统安装 ...................................................................................................................... 4 1.4.1、放入光盘 ............................................................................................................... 4 1.4.2、设置centos的配置 .............................................................................................. 5 1.4.3、语言设置 ............................................................................................................... 5 1.4.4、磁盘驱动方式 ....................................................................................................... 5 1.4.5、配置机器名和网络配置 ....................................................................................... 6 1.4.6、配置时区 ............................................................................................................... 8 1.4.7、设置root密码 ...................................................................................................... 8 1.4.8、设置磁盘 ............................................................................................................... 9 1.4.9、选择linux 的安装方式 ...................................................................................... 12 1.4.10、进入自动安装界面 ........................................................................................... 14 1.4.11、重启动,进行后续配置。 ............................................................................... 14 2、Hadoop平台搭建及维护 ................................................................................................. 15 2.1、搭建 Hadoop 集群 ................................................................................................... 15 2.2、hadoop维护............................................................................................................... 19 2.3系统日志说明 ............................................................................................................... 20 2.4常见系统问题及应对办法 ........................................................................................... 21

1、操作系统搭建及网络拓扑图

网络拓扑如上图所示,基于 Hadoop 架构的计算存储集群。该集群由一个主(Master)节点(同时为slave节点),共计4台(Slave)shuju节点组成,其中Master选择其中一台IBM服务器,三台Slave都是基于 X86 架构的 PC 机和IBM master节点上

1.1、操作系统版本

该平台采用centos 6.4操作系统,内核版本:

Linux master 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 x86_64 x86_64 GNU/Linux

1.2、硬件配置

配置 2620*2;64GB内存;4块1TB角色 Master,datanode 型号 IBM3650 磁盘 DellPC服务器 Pc服务器 Pc服务器 2TB;12GB内存 2TB磁盘;8GB内存 2TB磁盘;12GB内存 Datanode Datanode Datanode 1.3、系统账户

账户名称 Root Hadoop 密码 Password~!@ Hadoop 1.4、系统安装

前提条件:准备centos 6.4 系统安装光盘、

1.4.1、放入光盘

将光盘插入光驱,启动服务器,进入服务器的BIOS,修改驱动为dvd驱动。重启后进入安装界面。

默认选择第一个然后回车,用tab键完成切换到skip 回车

选择 skip

1.4.2、设置centos的配置 1.4.3、语言设置

选择next

选择语言为english ,next 选择 u.sEnglish ,next

1.4.4、磁盘驱动方式

选择第一个basic storage devices 然后next 会弹出一个提示框,

选择yes

1.4.5、配置机器名和网络配置

弹出虚拟机hostname 和网络配置

添加hostname nn

还是这个界面,在下面有个 congfig network 按钮,点击,打开配置界面如下

我们点edit按钮编辑eth0

需要注意一下几点

1、 界面上需要勾选 connect automatically 开机自动连接;Acailable to all users 适用所有用户这两个选择项。

2、 选择IPv4 settings 其他的页签无需设置

3、 选择Mothod 为 Manual ,然后点Add 按钮增加address信息

4、 Address 为172.31.117.120; netmask 为255.255.255.0 ;gatway 为172.31.117.10 DNS server 为114.114.114.114 这个四个设置

另外如果是虚拟机这四个设置还需要与本机设置在同一网段,同一网关,同一DNS

其他无需设置,选择apply 。然后close 掉网络配置(congfig network)界面

设置完毕hostname和network配置后,next

1.4.6、配置时区

1.4.7、设置root密码

选择亚洲,上海作为时区。Next

设置root登录密码 root123 确定后会有提示,不管他,选择use anyway

1.4.8、设置磁盘

安装方式设置,选择最下面一个 create custom layout next 进入磁盘分区界面

选择free空间,点create按钮, 进行添加设置

选择第一个 标准设置。点create 进行创建 先设计swap ,size(MB) 大小为10G 即 10240 M additional size options 选择是fixed size 固定大小 其他无需设置 ok即可

在选择余下空间继续进行创建

创建过程一样,选择标准创建,下面界面有些区别

Mount point 选择根“/” ;file system type 选择 ext4 ;additional size options 选择 fill to maximum allowable size 即允许使用的最多空间,也就是余下的所有空间。 Ok即可。 最后结果如下图

然后是next

会有提示框

不用搭理它,选择format 继续,还有提示

选择 write changes to disk 继续

1.4.9、选择linux 的安装方式

继续next,会弹出提示,desktop 桌面顶级安装也就是最大安装,并且有桌面;minimal desktop 桌面简版安装,有桌面,但是东西不是很全,minimal 最小安装,没有桌面;如果是在服务器上做基础操作系统,一般是desktop安装,如果是虚拟环境下的一般是minimal 最小安装。其他几种目前用不到,不作解释。

确定后,next ,

1.4.10、进入自动安装界面

安装完毕后进入下面界面

没得选择,直接reboot

1.4.11、重启动,进行后续配置。

重启后如果是desktop安装,需要设置一些东西 Welcome 选择forward License 选择forward Information 选择forward Create user 选择 forward

Date and time 选择syncchronize date and time over the network 然后forward Kdump 去掉 enable dkump 完成设置。

重启。Centos安装完毕。

2、Hadoop平台搭建及维护

2.1、搭建 Hadoop 集群

a) 对选取的集群节点,进行角色分配搭建集群共四个节点,其中1台 Master节点,4个 Slave 节点。

b) 在各个节点之间配置 ssh 无密码登陆。

c) 配置 core-site.xml、hdfs-site.xml、mapred-site.xml 等文件,使之符合系统测试环境的要求。

core-site.xml---详细内容

hadoop.tmp.dir

/hadoop/tmp/hadoop-${user.name}

fs.default.name hdfs://master:9000

hadoop.proxyuser.hadoop.hosts *

hadoop.proxyuser.hadoop.groups *

hadoop.proxyuser.hue.hosts *

hadoop.proxyuser.hue.groups *

fs.trash.interval 10080

Number of minutes between trash checkpoints. If zero, the trash feature is disabled

hdfs-site.xml—详细内容

dfs.replication 2

dfs.namenode.name.dir /hadoop/dfs/data

dfs.datanode.data.dir /data

dfs.webhdfs.enabled true

mapred-site.xml----详细内容

mapred.job.tracker master:9001

mapred.job.reuse.jvm.num.tasks -1

mapred.tasktracker.map.tasks.maximum 3

mapred.tasktracker.reduce.tasks.maximum 2

mapred.child.java.opts -Xmx256m

mapred.jobtracker.plugins

org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin

Comma-separated list of jobtracker plug-ins to be activated.

mapred.local.dir /data/local

d) 格式化文件系统,使用 HADOOP/bin 目录下的 hadoop 命令对 HDFS 的文件系统进行格式化。

e) 使用命令 HADOOP/bin/start-dfs.sh 在 Master 上启动 NameNode 进程,在 Slave节点上会启动对应的 DataNode 进程。

f) 使用 HADOOP/bin/start-mapred.sh 命令在 Master 节点上启动 JobTracker 进程,系统会在 conf/slaves 中设置的节点上启动 TaskTracker 进程。

(2) 搭建 HBase 集群。

a) 解压下载的 HBase 的压缩包,更改其 conf 目录下的环境变量 Hbase-env.sh 配置。其中,hbase.rootdir 这一项的配置必须

与 hdfs 的 fs.name.default 项一致,

并且为 HBase 指定根目录/hbase。

b) 配置 Regionserver 文件,指定了 HBase 中 Regionserver 节点的具体的信息。

c) 配置完毕后将 HBase 目录拷贝到各个节点相同目录下。 d) 启动 HBase。在 HMaster 上运行 bin/start-hbase.sh 命令启动 hbase 集群,系统首先将启动 zookeeper 进程,然后启动 master 进程,最后启动 Regionserver进程。

2.2、hadoop维护

Hadoop所有命令都在/opt/hadoop/bin

hadoop namenode –format 格式化文件系统 start-all.sh 启动hadoop所有进程 stop-all.sh 关闭hadoop所有进程

在web上查看以下几个ip可以监控hadoop状态 222.29.118.171:60010 -------监控HBASE所有状态

222.29.118.171:50070 --------查看HDFS所有状态

2.3、系统日志说明

基于hadoop组件包,所有组件包模块日志存放位置: Tomcat服务日志目录:/opt/bdp-web-1.0/tomcat/logs Hadoop hdfs日志目录:

Namenode服务日志目录:/opt/hadoop/logs Datanode服务日志目录:/opt/hadoop/logs Journalnode服务日志目录:/opt/hadoop/logs Zookeeper服务日志目录:/opt/zookeeper/ ZKFC服务日志目录:/opt/hadoop/logs Hadoop MR日志目录:

Jobtracker服务日志目录:/opt/hadoop-mr1/logs Tasktracker服务日志目录:/opt/hadoop-mr1/logs Hadoop Hbase日志目录:

Hmaster服务日志目录:/opt/hbase/logs

Regionserver服务1日志目录:/opt/hbase/logs

2.4、常见系统问题及应对办法 2.4.1、主namenode服务故障报警

现象描述:

使用web访问主机IP:50070 显示不能正常打开;例如:http://10.1.20.23:50070; 确认方法:

后台进入系统用户下输入jps显示集群服务状态 标记红色框内的服务不存在 排除方法:

第一步:登录到监控告警节点,查看日志告警记录,命令如下:

cd /opt/hadoop/logs //进入日志目录

tail -200f hadoop-hadoop-zkfc-TY101-M01.log //查看日志最新200条日志信息 第二步:重新启动namenode服务;命令如下:

/opt/hadoop/sbin/hadoop-daemon.sh start namenode //重启启动nemenode服务集群; 查看日志信息:tail –200f /opt/ hadoop/logs/hadoop-hadoop-namenode-TY101-M01.log 如果没有异常,服务正常启动。

第三步:确认服务是否正常,查看日志状态和使用jps查看服务状态

正常开启后,日志后打印在/opt/hadoop/logs下的hadoop-demo-namenode-master.log 日志显示如下为正常状态:

2013-09-29 10:27:42,892 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:

/************************************************************ STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = master/192.168.3.10 STARTUP_MSG: args = []

STARTUP_MSG: version = 2.0.0-cdh4.1.2 STARTUP_MSG: classpath =

/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/commons-io-2.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-httpclient-3.1.jar:/opt/hadoop/share/hadoop/common/lib/jets3t-0.6.1.jar:/opt/hadoop/share/hadoop/common/lib/kfs-0.3.jar:/opt/hadoop/share/hadoop/common/lib/hadoop-annotations-2.0.0-cdh4.1.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-el-1.0.jar:/opt/hadoop/share/hadoop/common/lib/jsp-api-2.1.jar:/opt/hadoop/share/hadoop/common/lib/mockito-all-1.8.5.jar:/opt/hadoop/share/hadoop/common/lib/jersey-core-1.8.jar:/opt/hadoop/share/hadoop/common/lib/servlet-api-2.5.jar:/opt/hadoop/share/hadoop/common/lib/jetty-util-6.1.26.cloudera.2.jar:/opt/hadoop/share/hadoop/common/lib/hadoop-auth-2.0.0-cdh4.1.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/opt/hadoop/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:/opt/hadoop/share/hadoop/common/lib/jsch-0.1.42.jar:/opt/hadoop/share/hadoop/common/lib/jetty-6.1.26.cloudera.2.jar:/opt/hadoop/share/hadoop/common/lib/slf4j-api-1.6.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar:/opt/hadoop/share/hadoop/common/lib/commons-codec-1.4.jar:/opt/hadoop/share/hadoop/common/lib/jackson-core-asl-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/guava-11.0.2.jar:/opt/hadoop/share/hadoop/common/lib/paranamer-2.3.jar:/opt/hadoop/share/hadoop/common/lib/avro-1.7.1.cloudera.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-lang-2.5.jar:/opt/hadoop/share/hadoop/common/lib/jersey-server-1.8.jar:/opt/hadoop/share/hadoop/common/lib/jackson-xc-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/commons-logging-1.1.1.jar:/opt/hadoop/share/hadoop/common/lib/activation-1.1.jar:/opt/hadoop/share/hadoop/common/lib/protobuf-java-2.4.0a.jar:/opt/hadoop/share/hadoop/common/lib/zookeeper-3.4.3-cdh4.1.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:/opt/hadoop/share/hadoop/common/lib/asm-3.2.jar:/opt/hadoop/share/hadoop/common/lib/jasper-compiler-5.5.23.jar:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar:/opt/hadoop/share/hadoop/common/lib/jackson-jaxrs-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/jackson-mapper-asl-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/jersey-json-1.8.jar:/opt/hadoop/share/had

2013-09-29 10:29:46,795 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/mnt/share/current/edits_0000000000000050373-0000000000000050374' to transaction ID 50373

he reported blocks 944 has reached the threshold 0.9990 of total blocks 944. Safe mode will be turned off automatically in 9 seconds.

2013-09-29 10:28:15,892 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 32 secs.

2013-09-29 10:28:15,892 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe mode is OFF.

2.4.2、备namenode服务故障报警

现象描述:

使用web访问主机IP:50070 显示不能正常打开;例如:http://10.1.20.32:50070; 确认方法:

后台进入系统用户下输入jps显示集群服务状态 标记红色框内的服务不存在 排除方法:

第一步:登录到监控告警节点,查看日志告警记录,命令如下: cd /opt/hadoop/logs //进入日志目录

tail -200f hadoop-hadoop-zkfc-TY102-M02.log //查看日志最新200条日志信息 第二步:重新启动namenode服务;命令如下:

/opt/hadoop/sbin/hadoop-daemon.sh start namenode //重启启动nemenode服务集群; 查看日志信息:tail –200f /opt/ hadoop/logs/hadoop-hadoop-namenode-TY102-M02.log 如果没有异常,服务正常启动。

第三步:确认服务是否正常,查看日志状态和使用jps查看服务状态

正常开启后,日志后打印在/opt/hadoop/logs下的hadoop-demo-namenode-master.log 日志显示如下为正常状态:

2013-09-29 10:27:42,892 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:

/************************************************************ STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = master/192.168.3.10 STARTUP_MSG: args = []

STARTUP_MSG: version = 2.0.0-cdh4.1.2 STARTUP_MSG: classpath =

/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/commons-io-2.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-httpclient-3.1.jar:/opt/hadoop/share/hadoop/common/lib/jets3t-0.6.1.jar:/opt/hadoop/share/hadoop/common/lib/kfs-0.3.jar:/opt/hadoop/share/hadoop/common/lib/hadoop-annotations-2.0.0-cdh4.1.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-el-1.0.jar:/opt/hadoop/share/hadoop/common/lib/jsp-api-2.1.jar:/opt/hadoop/share/hadoop/common/lib/mockito-all-1.8.5.jar:/opt/hadoop/share/hadoop/common/lib/jersey-core-1.8.jar:/opt/hadoop/share/hadoop/common/lib/servlet-api-2.5.jar:/opt/hadoop/share/hadoop/common/lib/jetty-util-6.1.26.cloudera.2.jar:/opt/hadoop/share/hadoop/common/lib/hadoop-auth-2.0.0-cdh4.1.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/opt/hadoop/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:/opt/hadoop/share/hadoop/common/lib/jsch-0.1.42.jar:/opt/hadoop/share/hadoop/common/lib/jetty-6.1.26.cloudera.2.jar:/opt/hadoop/share/hadoop/common/lib/slf4j-api-1.6.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar:/opt/hadoop/share/hadoop/common/lib/commons-codec-1.4.jar:/opt/hadoop/share/hadoop/common/lib/jackson-core-asl-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/guava-11.0.2.jar:/opt/hadoop/share/hadoop/common/lib/paranamer-2.3.jar:/opt/hadoop/share/hadoop/common/lib/avro-1.7.1.cloudera.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-lang-2.5.jar:/opt/hadoop/share/hadoop/common/lib/jersey-server-1.8.jar:/opt/hadoop/share/hadoop/common/lib/jackson-xc-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/commons-logging-1.1.1.jar:/opt/hadoop/share/hadoop/common/lib/activation-1.1.jar:/opt/hadoop/share/hadoop/common/lib/protobuf-java-2.4.0a.jar:/opt/hadoop/share/hadoop/common/lib/zookeeper-3.4.3-cdh4.1.2.jar:/opt/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:/opt/hadoop/share/hadoop/common/lib/asm-3.2.jar:/opt/hadoop/share/hadoop/common/lib/jasper-compiler-5.5.23.jar:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar:/opt/hadoop/share/hadoop/common/lib/jackson-jaxrs-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/jackson-mapper-asl-1.8.8.jar:/opt/hadoop/share/hadoop/common/lib/jersey-json-1.8.jar:/opt/hadoop/share/hadoop/common/lib/log4j-1.2.17.jar:/opt/hadoop/share/hadoop/common/lib/commons-math-2.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-collections-3.2.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-digester-1.8.jar:/opt/hadoop/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/opt/hadoop/share/hadoop/common/lib/junit-4.8.2.jar:/opt/hadoop/share/hadoop/common/lib/jline-0.9.94.jar:/opt/hadoop/share/hadoop/common/lib/jsr305-1.3.9.jar:/opt/hadoop/share/hadoop/common/lib/jettison-1.1.jar:/opt/hadoop/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/opt/hadoop/share/hadoop/common/lib/commons-net-3.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/opt/hadoop/share/hadoop/common/lib/stax-api-1.0.1.jar:/opt/hadoop/share/hadoop/common/lib/xmlenc-0.52.jar:/opt/hadoop/share/hadoop/common/lib/jasper-runtime-5.5.23.jar:/opt/hadoop/share/hadoop/common/hadoop-common-2.0.0-cdh4.1.2-sources.jar:/opt/hadoop/share/hadoop/common/hadoop-common-2.0.0-cdh4.1.2.jar:/opt/hadoop/share/hadoop/common/hadoop-common-2.0.0-cdh4.1.2-tests.jar:/opt/hadoop/share/hadoop/common/hadoop-common-2.0.0-cdh4.1.2-test-sources.jar:/opt/hive/lib/hive-hbase-handler-0.9.0-cdh4.1.2.jar:/opt/hbase/hbase-0.92.1-cdh4.1.2-security.jar:/opt/hbase/lib/zookeeper-3.4.3-cdh4.1.2.jar:/opt/hbase/conf:/opt/hive/lib/hive-hbase-handler-0.9.0-cdh4.1.2.jar:/opt/

2013-09-29 10:29:46,795 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/mnt/share/current/edits_0000000000000050373-0000000000000050374' to transaction ID 50373

he reported blocks 944 has reached the threshold 0.9990 of total blocks 944. Safe mode will be turned off automatically in 9 seconds.

2013-09-29 10:28:15,892 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 32 secs.

2013-09-29 10:28:15,892 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe mode is OFF.

2.4.3、Zookeeper服务故障报警

现象描述:syslog告警 Syslog日志报警信息:

例如:snode1上的zookeeper服务故障信息如下:

[2013-10-20 10:00:02 INFO ] [com.cms.web.syslog.SyslogUtil:100] - 发送:

CEB-HDQS|+|CEB-HDQS|+|1001|+|1001|+|NA|+|TY101-001|+|检测Zookeeper务状态

|+|TY101-001:2181|+|dead|+|APP|+|HDQS|+|Zookeeper|+|1|+|TY101-001上Zookeeper服务故障|+|1382234280|+|xiaoxu|+|13810466464 确认方法:登录到故障节点:ssh snode1

使用jps查看服务状态;如果标识红色服务不存在,故服务故障 通过BDP管理界面获取snode1上zookeeper服务dead节点服务信息:

平台首页>管理控制台>集群监控>集群服务监控

排除方法:

第一步:登录到故障节点, ssh snode1

查看zookeeper日志目录;查看日志信息

注:zookeeper日志存放路径会在当前运行zookeeper启动命令处创建。 例如:在/home/hadoop下运行: /opt/zookeeper/bin/zkServer.sh start 命令如下:cd /home/hadoop //登录到日志目录 tail -200f zookeeper.out // 查看最近200条日志记录 第二步:重新启动zookeeper服务

命令如下:/opt/zookeeper/bin/zkServer.sh start //开启zookeeper服务 开启查看日志显示:

2013-09-29 11:19:20,620 [myid:0] - INFO

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@202] - Accepted socket connection from /192.168.3.10:42504

2013-09-29 11:19:20,621 [myid:0] - WARN

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2013-09-29 11:19:20,621 [myid:0] - INFO

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /192.168.3.10:42504 (no session established for client) 2013-09-29 11:19:20,787 [myid:0] - INFO

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@202] - Accepted socket connection from /192.168.3.14:33207

2013-09-29 11:19:20,788 [myid:0] - WARN

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2013-09-29 11:19:20,788 [myid:0] - INFO

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /192.168.3.14:33207 (no session established for client) 2013-09-29 11:19:20,860 [myid:0] - INFO

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@202] - Accepted socket connection from /192.168.3.12:50742

2013-09-29 11:19:20,860 [myid:0] - WARN

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2013-09-29 11:19:21,124 [myid:0] - INFO

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /192.168.3.12:50742 (no session established for client) 2013-09-29 11:19:21,124 [myid:0] - INFO

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@202] - Accepted socket connection from /192.168.3.15:47952 2013-09-29 11:19:21,124 [myid:0] - INFO

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@202] - Accepted socket connection from /192.168.3.10:42505

2013-09-29 11:19:21,125 [myid:0] - WARN

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2013-09-29 11:19:21,125 [myid:0] - INFO

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /192.168.3.15:47952 (no session established for client) 2013-09-29 11:19:21,125 [myid:0] - WARN

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2013-09-29 11:19:21,125 [myid:0] - INFO

[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1000] - Closed socket connection for client /192.168.3.10:42505 (no session established for client) 在不同节点启动zookeeper 显示:

/lib/log4j-1.2.15.jar:/opt/zookeeper/bin/../lib/jline-0.9.94.jar:/opt/zookeeper/bin/../zookeeper-3.4.3-cdh4.1.2.jar:/opt/zookeeper/bin/../src/java/lib/*.jar:/opt/zookeeper/bin/../conf: 2013-09-29 11:21:03,055 [myid:2] - INFO

[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server

environment:java.library.path=/opt/java/jre/lib/amd64/server:/opt/java/jre/lib/amd64:/opt/java/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib 2013-09-29 11:21:03,056 [myid:2] - INFO

[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:java.io.tmpdir=/tmp

2013-09-29 11:21:03,056 [myid:2] - INFO

[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:java.compiler=

2013-09-29 11:21:03,056 [myid:2] - INFO

[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:os.name=Linux

2013-09-29 11:21:03,057 [myid:2] - INFO

[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:os.arch=amd64

2013-09-29 11:21:03,057 [myid:2] - INFO

[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:os.version=2.6.32-358.el6.x86_64 2013-09-29 11:21:03,057 [myid:2] - INFO

[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:user.name=demo

2013-09-29 11:21:03,057 [myid:2] - INFO

[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:user.home=/home/demo 2013-09-29 11:21:03,058 [myid:2] - INFO

[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Environment@100] - Server environment:user.dir=/home/demo

2013-09-29 11:21:03,059 [myid:2] - INFO

[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@162] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /var/zookeeper/version-2 snapdir /var/zookeeper/version-2 2013-09-29 11:21:03,059 [myid:2] - INFO

[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEADER ELECTION TOOK - 233

2013-09-29 11:21:03,066 [myid:2] - INFO

[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@325] - Getting a snapshot from leader 2013-09-29 11:21:03,073 [myid:2] - INFO

[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@270] - Snapshotting: 0x200000001 to /var/zookeeper/version-2/snapshot.200000001

如果在上述操作没有在30分钟之内恢复。请启动应急预案中的集群主备切换操作 具体操作方法请参照《HDQS-AM-004历史数据查询系统应急处理手册》

2.4.4、datanode服务故障

现象描述:syslog服务告警syslog日志上传报警信息

以下为检测到TY103-006服务器上datanode服务dead信息。

2013-10-20 10:05:02 INFO ] [com.cms.web.syslog.SyslogUtil:100] - 发送:

CEB-HDQS|+|CEB-HDQS|+|1001|+|1001|+|NA|+|TY103-006|+|检测Datanode服务状态|+|TY103-006|+|dead|+|APP|+|HDQS|+|Datanode|+|1|+|TY103-006上

Datanode服务故障|+|1382234640|+|xiaoxu|+|13810466464

确认方法:登录到hadoop界面查看节点状态,登录方法: http://10.1.242.182:50070 点击Dead nodes会显示出已经down掉的节点,红色标记处 查看datanode服务故障服务器日志信息:

2013-10-20 09:42:56,741 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1800471205-192.168

.0.75-1376982711635:blk_2801755526513394545_957410

2013-10-20 09:44:01,341 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1800471205-192.168

.0.75-1376982711635:blk_8004390811280095539_189478

2013-10-20 09:45:05,940 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1800471205-192.168

.0.75-1376982711635:blk_-7558433326837136207_950828

2013-10-20 09:45:05,964 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1800471205-192.168

.0.75-1376982711635:blk_961832263082316047_963685

2013-10-20 09:45:06,141 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1800471205-192.168

.0.75-1376982711635:blk_3084063903778417422_751015

2013-10-20 09:45:06,159 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1800471205-192.168

.0.75-1376982711635:blk_-3145692262741932109_458312

2013-10-20 09:45:19,741 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1800471205-192.168

2013-09-29 11:40:09,693 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For

namenode datanode1/192.168.3.11:8020 using DELETEREPORT_INTERVAL of 300000 msec BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; heartBeatInterval=3000 2013-09-29 11:40:09,731 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Namenode Block pool BP-1224883789-192.168.3.10-1378444984820 (storage id

DS-621562718-192.168.3.15-50010-1378445052913) service to datanode1/192.168.3.11:8020 trying to claim ACTIVE state with txid=50455

2013-09-29 11:40:09,731 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:

Acknowledging ACTIVE Namenode Block pool BP-1224883789-192.168.3.10-1378444984820 (storage id DS-621562718-192.168.3.15-50010-1378445052913) service to datanode1/192.168.3.11:8020

2013-09-29 11:40:09,770 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 707 blocks took 3 msec to generate and 36 msecs for RPC and NN processing

2013-09-29 11:40:09,771 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: sent block report, processed command:null

2013-09-29 11:40:09,772 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 707 blocks took 2 msec to generate and 39 msecs for RPC and NN processing

2013-09-29 11:40:09,772 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: sent block report, processed

command:org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@71b493c6

2013-09-29 11:40:09,774 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Periodic Block Verification Scanne

r initialized with interval 504 hours for block pool BP-1224883789-192.168.3.10-1378444984820. 2013-09-29 11:40:09,784 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Added bpid=BP-1224883789-192.168.3.10-1378444984820 to blockPoolScannerMap, new size=1

2013-09-29 11:40:09,002 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Refresh request received for nameservices: bdpha

如果在上述操作没有在30分钟之内恢复。请启动应急预案中的集群主备切换操作 具体操作方法请参照《HDQS-AM-004历史数据查询系统应急处理手册》

2.4.5、journalnode服务故障报警

现象描述:syslog告警

确认方法:登录到故障节点,使用jps查看服务状态;如果标识红色服务不存在,故服务故障

排除方法:

第一步:登录到故障节点,进入journalnode日志目录;查看日志信息 命令如下:cd /opt/hadoop/logs//登录到日志目录

tail -200f hadoop-hadoop-journalnode-TY101-M01.log // 查看最近200条日志记录 第二步:重新启动journalnode服务 命令如下:

/opt/hadoop/sbin/hadoop-daemon.sh start journalnode //开启journalnode服务 /opt/hadoop/sbin/hadoop-daemon.sh stop journalnode//关闭journalnode服务 查看启动系统日志

tail -200f hadoop-hadoop-journalnode-TY101-M01.log // 查看最近200条日志记录 第三步:查看服务状态;使用jps查看即可

2.4.6、jobtracker服务故障

现象描述:syslog告警同时业务中断 Syslog报警日志信息:

[2013-10-20 10:55:02 INFO ] [com.cms.web.syslog.SyslogUtil:100] - 发送:

CEB-HDQS|+|CEB-HDQS|+|1001|+|1001|+|NA|+|TY101-M01|+|检测JobTracker服务状态|+|TY101-M01|+|dead|+|APP|+|HDQS|+|JobTracker|+|1|+|TY101-M01上JobTracker服务故障|+|1382237640|+|xiaoxu|+|13810466464

确认方法:登录到故障节点,使用jps查看服务状态,红色标注的服务不存在 使用BDP监控平台查看信息:在平台首页>管理控制台>集群监控>集群服务监控

排除方法:

第一步:查看jobtracker服务日志

cd /opt/hadoop-mr1/logs //进入日志目录

tail -200f hadoop-hadoop-jobtracker-TY101-M01.log //查看日志信息 第二步:启动jobtracker服务

命令如下:/opt/hadoop-mr1/bin/hadoop-daemon.sh start jobtracker //开启jobtracker服务 查看启动服务日志

tail -200f hadoop-hadoop-jobtracker-TY101-M01.log //查看日志信息 或者重启MR服务

首先先关闭MR服务:/opt/hadoop-mr1/bin/stop-mapred.sh 然后开启MR服务:/opt/hadoop-mr1/bin/start-mapred.sh 最后在查看日志:

tail -200f /opt/hadoop-mr1/logs/hadoop-hadoop-jobtracker-TY101-M01.log 第三步:日志如无异常,查看服务状态即可,使用jps产看 日志显示如下:

2013-09-29 14:34:42,984 INFO org.apache.hadoop.mapred.JobTracker: Recovery done! Recoverd 0 of 0 jobs.

2013-09-29 14:34:42,984 INFO org.apache.hadoop.mapred.JobTracker: Recovery Duration (ms):1

2013-09-29 14:34:42,984 INFO org.apache.hadoop.mapred.JobTracker: Refreshing hosts information

2013-09-29 14:34:42,996 INFO org.apache.hadoop.util.HostsFileReader: Setting the includes file to

2013-09-29 14:34:42,996 INFO org.apache.hadoop.util.HostsFileReader: Setting the excludes file to

2013-09-29 14:34:42,996 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list

2013-09-29 14:34:42,996 INFO org.apache.hadoop.mapred.JobTracker: Decommissioning 0 nodes

2013-09-29 14:34:42,997 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2013-09-29 14:34:42,997 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 9001: starting

2013-09-29 14:34:42,999 INFO org.apache.hadoop.mapred.JobTracker: Starting RUNNING 2013-09-29 14:34:43,013 WARN org.apache.hadoop.mapred.JobTracker: Serious problem, cannot find record of 'previous' heartbeat for 'tracker_datanode1:localhost/127.0.0.1:34514'; reinitializing the tasktracker

2013-09-29 14:34:43,017 WARN org.apache.hadoop.mapred.JobTracker: Serious problem, cannot find record of 'previous' heartbeat for 'tracker_datanode5:localhost/127.0.0.1:50273'; reinitializing the tasktracker

2013-09-29 14:34:43,017 WARN org.apache.hadoop.mapred.JobTracker: Serious problem, cannot find record of 'previous' heartbeat for 'tracker_datanode2:localhost/127.0.0.1:47020'; reinitializing the tasktracker

2013-09-29 14:34:43,017 WARN org.apache.hadoop.mapred.JobTracker: Serious problem, cannot find record of 'previous' heartbeat for 'tracker_datanode4:localhost/127.0.0.1:58744'; reinitializing the tasktracker

2013-09-29 14:34:43,145 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/datanode1

2013-09-29 14:34:43,147 INFO org.apache.hadoop.mapred.JobTracker: Adding tracker tracker_datanode1:localhost/127.0.0.1:57061 to host datanode1

2013-09-29 14:34:43,150 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/datanode5

2013-09-29 14:34:43,150 INFO org.apache.hadoop.mapred.JobTracker: Adding tracker tracker_datanode5:localhost/127.0.0.1:42437 to host datanode5

2013-09-29 14:34:43,158 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/datanode4

2013-09-29 14:34:43,158 INFO org.apache.hadoop.mapred.JobTracker: Adding tracker tracker_datanode4:localhost/127.0.0.1:33865 to host datanode4

2013-09-29 14:34:43,171 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/datanode2

2013-09-29 14:34:43,171 INFO org.apache.hadoop.mapred.JobTracker: Adding tracker tracker_datanode2:localhost/127.0.0.1:34548 to host datanode2

2.4.7、tasktracker服务故障

现象描述:syslog告警

确认方法:登录到故障节点,使用jps查看服务状态,红色标注的服务不存在 排除方法:

第一步:查看tasktracker服务日志

cd /opt/hadoop-mr1/logs //进入日志目录

tail -200f hadoop-hadoop-tasktracker-TY101-001.log //查看日志信息 第二步:启动tasktracker服务

命令如下:/opt/hadoop-mr1/bin/hadoop-daemon.sh start tasktracker//开启tasktracker服务 查看启动服务日志

tail -200f hadoop-hadoop-tasktracker-TY101-001.log //查看日志信息 第三步:日志如无异常,查看服务状态即可,使用jps产看 正常开启tasktracker日志信息:

2013-09-29 14:28:10,534 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1

2013-09-29 14:28:10,550 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker with owner as demo

2013-09-29 14:28:10,550 WARN org.apache.hadoop.conf.Configuration: slave.host.name is deprecated. Instead, use dfs.datanode.hostname

2013-09-29 14:28:10,551 INFO org.apache.hadoop.mapred.TaskTracker: Good mapred local directories are: /hadoop/tmp/hadoop-demo/mapred/local

2013-09-29 14:28:10,581 WARN org.apache.hadoop.conf.Configuration: session.id is deprecated. Instead, use dfs.metrics.session-id

2013-09-29 14:28:10,581 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=TaskTracker, sessionId=

2013-09-29 14:28:10,628 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 50273

2013-09-29 14:28:10,652 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2013-09-29 14:28:10,652 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50273: starting

2013-09-29 14:28:10,654 INFO org.apache.hadoop.mapred.TaskTracker: TaskTracker up at: localhost/127.0.0.1:50273

2013-09-29 14:28:10,654 INFO org.apache.hadoop.mapred.TaskTracker: Starting tracker tracker_datanode5:localhost/127.0.0.1:50273

2013-09-29 14:28:10,672 INFO org.apache.hadoop.mapred.TaskTracker: Starting thread: Map-events fetcher for all reduce tasks on tracker_datanode5:localhost/127.0.0.1:50273

2013-09-29 14:28:10,678 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0 2013-09-29 14:28:10,681 INFO org.apache.hadoop.mapred.TaskTracker: Using

ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4cc5aa00 2013-09-29 14:28:10,682 WARN org.apache.hadoop.mapred.TaskTracker: TaskTracker's totalMemoryAllottedForTasks is -1 and reserved physical memory is not configured. TaskMemoryManager is disabled.

2013-09-29 14:28:10,683 INFO org.apache.hadoop.mapred.IndexCache: IndexCache created with max memory = 10485760

2013-09-29 14:28:10,690 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50060 2013-09-29 14:28:10,690 INFO org.mortbay.log: jetty-6.1.26.cloudera.2

2013-09-29 14:28:10,875 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50060

2013-09-29 14:28:10,875 INFO org.apache.hadoop.mapred.TaskTracker: FILE_CACHE_SIZE for mapOutputServlet set to : 2000

2.4.8、主hmaster服务故障报警

现象描述:syslog日志告警 Syslog报警日志信息:

[2013-10-20 10:40:02 INFO ] [com.cms.web.syslog.SyslogUtil:100] - 发送:

CEB-HDQS|+|CEB-HDQS|+|1001|+|1001|+|NA|+|TY101-M01|+|检测HMaster服状态

|+|TY101-M01:60000|+|dead|+|APP|+|HDQS|+|HMaster|+|1|+|TY101-M01上HMaster服务故障|+|1382236562|+|xiaoxu|+|13810466464

使用web访问主机IP:60010 显示不能正常打开;例如:http://10.1.20.58:60010; 确认方法:

后台进入系统用户下输入jps显示集群服务状态 标记红色框内的服务不存在

使用BDP平台可以查看到dead掉的hmaster服务

平台首页>管理控制台>集群监控>集群服务监控

排除方法:

第一步:登录到故障节点,查看日志告警记录,命令如下: cd /opt/hbase/logs下//进入日志目录

tail -200f hbase-hadoop-master-TY101-M01.log//查看日志最新200条日志信息 日志显示:

2013-10-20 10:33:15,466 DEBUG

org.apache.hadoop.hbase.client.HTable$ClientScanner: Creating scanner over .META. starting at key ''

2013-10-20 10:33:15,467 DEBUG org.apache.hadoop.hbase.client.HTable$ClientScanner: Advancing internal scanner to startKey at ''

2013-10-20 10:33:15,604 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=7 regio

ns=2262 average=323.14285 mostloaded=324 leastloaded=323

2013-10-20 10:33:16,347 DEBUG org.apache.hadoop.hbase.client.HTable$ClientScanner: Finished with scanning at {NAME => '.META.,,1', STARTKEY => '', ENDKEY => '', ENCODED => 1028785192,}

2013-10-20 10:33:16,347 DEBUG org.apache.hadoop.hbase.master.CatalogJanitor: Scanned 2260 catalog row(s) and gc'd 0 unreferenced parent reg ion(s)

2013年10月20日星期日10:35:31 CST Killing master 第二步:重新启动hmaster服务;命令如下:

/opt/hbase/sbin/ hbase-daemon.sh start master //重启启动hmaster服务

查看日志信息:tail –200f /opt/ hbase /logs/ hbase -hadoop-master-TY102-M02.log 如果没有异常,服务正常启动。

第三步:确认服务是否正常,查看日志状态和使用jps查看服务状态 查看日志显示:

2013-10-20 10:32:33,370 INFO org.apache.hadoop.hbase.master.ActiveMasterManager: Adding ZNode for /hbase/backup-masters/TY102-M02,60000,138 2236353080 in backup master directory

2013-10-20 10:32:33,374 INFO org.apache.hadoop.hbase.master.ActiveMasterManager: Another master is the active master, TY101-M01,60000,13821 10982913; waiting to become the next active master

如果在上述操作没有在30分钟之内恢复。请启动应急预案中的集群主备切换操作 具体操作方法请参照《HDQS-AM-004历史数据查询系统应急处理手册》

2.4.9、备hmaster服务故障报警

现象描述:syslog日志告警 Syslog报警日志信息:

[2013-10-20 10:40:02 INFO ] [com.cms.web.syslog.SyslogUtil:100] - 发送:

CEB-HDQS|+|CEB-HDQS|+|1001|+|1001|+|NA|+|TY102-M02|+|检测HMaster服状态

|+|TY102-M02:60000|+|dead|+|APP|+|HDQS|+|HMaster|+|1|+|TY101-M01上HMaster服务故障|+|1382236562|+|xiaoxu|+|13810466464

使用web访问主机IP:60010 显示不能正常打开;例如:http://10.1.20.59:60010; 确认方法:

后台进入系统用户下输入jps显示集群服务状态 标记红色框内的服务不存在

使用BDP平台可以查看到dead掉的hmaster服务

平台首页>管理控制台>集群监控>集群服务监控

排除方法:

因为hmaster由主备模式,当发现备用hmaster节点故障时,不会影响整个hbase的服务,我们尽快排查错误日志,重新启动服务即可

第一步:登录到故障节点,查看日志告警记录,命令如下: cd /opt/hbase/logs下//进入日志目录

tail -200f hbase-hadoop-master-TY102-M02.log//查看日志最新200条日志信息 日志显示:

2013-10-20 10:33:15,466 DEBUG org.apache.hadoop.hbase.client.HTable$ClientScanner: Creating scanner over .META. starting at key ''

2013-10-20 10:33:15,467 DEBUG org.apache.hadoop.hbase.client.HTable$ClientScanner: Advancing internal scanner to startKey at ''

2013-10-20 10:33:15,604 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=7 regio

ns=2262 average=323.14285 mostloaded=324 leastloaded=323

2013-10-20 10:33:16,347 DEBUG org.apache.hadoop.hbase.client.HTable$ClientScanner: Finished with

scanning at {NAME => '.META.,,1', STARTKEY => '', ENDKEY => '', ENCODED => 1028785192,}

2013-10-20 10:33:16,347 DEBUG org.apache.hadoop.hbase.master.CatalogJanitor: Scanned 2260 catalog row(s) and gc'd 0 unreferenced parent reg ion(s)

2013年10月20日星期日10:35:31 CST Killing master 第二步:重新启动hmaster服务;命令如下:

/opt/hbase/sbin/ hbase-daemon.sh start master //重启启动hmaster服务

查看日志信息:tail –200f /opt/ hbase /logs/ hbase -hadoop-master-TY102-M02.log 如果没有异常,服务正常启动。

第三步:确认服务是否正常,查看日志状态和使用jps查看服务状态 查看日志显示:

2013-10-20 10:32:33,370 INFO org.apache.hadoop.hbase.master.ActiveMasterManager: Adding ZNode for /hbase/backup-masters/TY102-M02,60000,138 2236353080 in backup master directory

2013-10-20 10:32:33,374 INFO org.apache.hadoop.hbase.master.ActiveMasterManager: Another master is the active master, TY101-M01,60000,13821 10982913; waiting to become the next standby master

2.4.10、regionserver服务故障报警

现象描述:syslog日志告警

Syslog报警日志信息:例如snode7的regionserver服务down掉

[2013-10-20 10:50:02 INFO ] [com.cms.web.syslog.SyslogUtil:100] - 发送:

CEB-HDQS|+|CEB-HDQS|+|1001|+|1001|+|NA|+|TY101-007|+|检测RegionServer服务状态|+|TY101-007:60020|+|dead|+|APP|+|HDQS|+|RegionServer|+|1|+|TY101-007上RegionServer服务故障|+|1382237162|+|xiaoxu|+|13810466464 后台进入系统用户下输入jps显示集群服务状态 标记红色框内的服务不存在

使用BDP监控平台查看snode7上的regionserver故障信息

平台首页>管理控制台>集群监控>集群服务监控

排除方法:

第一步:登录到监控告警节点,查看日志告警记录,命令如下: cd /opt/hbase/logs //进入日志目录

tail -200f hbase -hadoop-regionserver-TY101-M01.log //查看日志最新200条日志信息 第二步:重新启动regionserver服务;命令如下:

/opt/hbase/sbin/hbase-daemon.sh start regionserver//重启启动regionserver服务

查看日志信息:tail –200f /opt/ hbase /logs/ hbase -hadoop-regionserver-TY101-001.log 如果没有异常,服务正常启动。

第三步:确认服务是否正常,查看日志状态和使用jps查看服务状态 查看日志:

2013-09-29 14:30:13,236 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60020: starting

2013-09-29 14:30:13,236 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 0 on 60020: starting

2013-09-29 14:30:13,236 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 1 on 60020: starting

2013-09-29 14:30:13,237 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 2 on 60020: starting

2013-09-29 14:30:13,237 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 3 on 60020: starting

2013-09-29 14:30:13,237 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 4 on 60020: starting

2013-09-29 14:30:13,238 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 5 on 60020: starting

2013-09-29 14:30:13,239 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 6 on 60020: starting

2013-09-29 14:30:13,239 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 7 on 60020: starting

2013-09-29 14:30:13,239 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 8 on 60020: starting

2013-09-29 14:30:13,239 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 9 on 60020: starting

2013-09-29 14:30:13,239 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 0 on 60020: starting

2013-09-29 14:30:13,240 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 1 on 60020: starting

2013-09-29 14:30:13,240 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 2 on 60020: starting

2013-09-29 14:30:13,244 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Serving as datanode5,60020,1380436211922, RPC listening on datanode5/192.168.3.15:60020, sessionid=0x4167bb09210000

2013-09-29 14:30:13,244 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker datanode5,60020,1380436211922 starting

2013-09-29 14:30:13,245 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Registered RegionServer MXBean ...skipping...

2013-09-29 14:30:13,226 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server Responder: starting

2013-09-29 14:30:13,226 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server listener on 60020: starting

2013-09-29 14:30:13,234 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 0 on 60020: starting

2013-09-29 14:30:13,234 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 1 on 60020: starting

2013-09-29 14:30:13,234 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 2 on 60020: starting

2013-09-29 14:30:13,234 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 3 on 60020: starting

2013-09-29 14:30:13,235 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 4 on 60020: starting

2013-09-29 14:30:13,235 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 5 on 60020: starting

2013-09-29 14:30:13,235 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 6 on 60020: starting

2013-09-29 14:30:13,235 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 7 on 60020: starting

2013-09-29 14:30:13,236 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 8 on 60020: starting

2013-09-29 14:30:13,239 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 0 on 60020: starting

2013-09-29 14:30:13,240 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 1 on 60020: starting

2013-09-29 14:30:13,240 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 2 on 60020: starting

2013-09-29 14:30:13,244 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Serving as datanode5,60020,1380436211922, RPC listening on datanode5/192.168.3.15:60020, sessionid=0x4167bb09210000

2013-09-29 14:30:13,244 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker datanode5,60020,1380436211922 starting

2013-09-29 14:30:13,245 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Registered RegionServer MXBean

如果在上述操作没有在30分钟之内恢复。请启动应急预案中的集群主备切换操作 具体操作方法请参照《HDQS-AM-004历史数据查询系统应急处理手册》

2.4.11、磁盘故障报警-管理节点

当出现管理节点磁盘故障时,因为管理节点磁盘创建的是RAID5模式,RAID5可以允许一块磁盘的损坏,而不会造成数据的丢失。

管理节点磁盘故障时,syslog不会上传日志,只能通过人工查看物理服务器磁盘状态灯来进行查看。

故障排除方法:

第一步:定位故障磁盘

第二步:使用hp服务器自带的远程管理功能,ILO模式登录,查看故障磁盘,并直接更换磁盘,重新rebulid RAID组。完成rebulid即可。

2.4.12、磁盘故障报警-数据节点

数据节点磁盘系统盘创建的是RAID1,数据存储盘创建的RAID0, 当系统盘报错时按照如下操作: 第一步:定位故障磁盘

第二步:使用hp服务器自带的远程管理功能,ILO模式登录,查看故障磁盘,并直接更换磁盘,重新rebulid RAID组。完成rebulid即可。 当出现数据盘损坏时:

现象描述:服务器磁盘状态灯显示红色,或者服务器系统日志报IO错误。 确认方法:登录到系统查看系统日志 cd /var/log下

tail -200f messages //查看最近200条日志信息 或使用demsg功能查看

排除方法:

第一步:登录到服务器,首先将服务器所运行的服务关闭 当上地和酒仙桥node4节点数据磁盘故障时:

该节点运行着:datanode节点,运行着datanode服务,tasktracker服务,regionserver服务,journalnode服务 逐一关闭服务:

/opt/hbase/bin/hbase-daemon.sh stop regionserver

/opt/hadoop-mr1/bin/hadoop-daemon.sh stop tasktracker /opt/hadoop/sbin/hadoop-daemon.sh stop datanode /opt/hadoop/sbin/hadoop-daemon.sh stop journalnode 当上地snode1节点或酒仙桥snode3节点数据磁盘故障时 逐一关闭服务:

/opt/hbase/bin/hbase-daemon.sh stop regionserver

/opt/hadoop-mr1/bin/hadoop-daemon.sh stop tasktracker /opt/hadoop/sbin/hadoop-daemon.sh stop datanode /opt/zookeeper/bin/zkServer.sh stop

第二步:将故障磁盘拔出,更新磁盘,重新启动服务器,进入到HP管理控制卡界面,将新插入的磁盘创建为raid0; 之后正常启动服务器系统

第三步:将新的磁盘创建新的分区:例如fdisk /dev/sde按提示逐一创建 格式化新的磁盘:mkfs.ext3 /dev/sde1

挂载新盘到挂载点:mount /dev/sde1 /data3

使用root用户更改文件夹权限:chown –R hadoop:hadoop /data3 第四步:重新启动hadoop服务组件

opt/hadoop/sbin/hadoop-daemon.sh start datanode

/opt/hadoop-mr1/bin/hadoop-daemon.sh start tasktracker /opt/hbase/bin/hbase-daemon.sh start regionserver

第五步:查看服务状态;使用jps即可查询状态信息。

2.4.13、数据节点故障维护

现象描述:syslog日志告警,同时打开http://10.1.242.182:50070连接,存储空间减少一个节点容量,Dead nodes有显示死掉的机器。

确认方法:节点故障,需要人为查询,确认是由于网络引起还是机器死机,再者服务down机,如果服务down机,直接参照datanode服务排除方法即可。如果是网络或者是死机造成的需要人员干预查询。 排除方法:

例如网络故障引起的告警

第一步:排查网络故障原因,是网线问题还是网卡端口问题。 第二步:网络畅通之后,节点自动添加入集群,无需做任何操作 例如硬件故障导致服务不能正常启动 第一步:修复故障的服务器

第二步:重启机器,逐一开启hadoop服务即可 opt/hadoop/sbin/hadoop-daemon.sh start datanode

本文来源:https://www.bwwdw.com/article/84q5.html

Top