The editor here introduces two ways to import data, one is based on hive, and the other is to generate HFile based on basic files.
1. hive-hbase-handler guide data
This method requires a jar package support:
download link: https://down. 51cto.com/data/2464129
Put it into $HBASE_HOME/lib and copy the original jar package.
Next modify hive-site.xml:
#Join:hive.aux.jars.path< /name> file:///applications/hive-2.3.2/lib/hive-hbase-handler.jar,file:///applications/hive-2.3.2/lib/guava- 14.0.1.jar,file:///ap plications/hbase-2.0.5/lib/hbase-common-2.0.5.jar,file:///applications/hbase-2.0.5/lib/hbase-client -2.0.5.jar,file:///application s/hive-2.3.2/lib/zookeeper-3.4.6.jar < br /> hbase.zookeeper.quorum hadoop01:2181,hadoop02:2181,hadoop03:2181
< p>Import hive data into hbase:
① Create hive table:
create table hive_hbase_test(id int,name string,age int);
② Insert data into hive table
insert into hive_hbae_test(id,name,age) values(1,"xiaozhang","18");
insert into hive_hbase_test(id,name,age) values(2 ,"xiaowang","19");
The test environment here can insert data like this, and the real environment is best to use appearance.
③ Mapping Hbase table
create table hive_hbase_pro(row_key string,id bigint,name string,age int) STORED BY "org.apache.hadoop.hive.hbase.HBaseStorageHandler" WITH SERDEPROPERTIES (" hbase.columns.mapping" = ":key,info:id,info:name,info:age") TBLPROPERTIES ("hbase.table.name"="hive_hbase_pro");
At this time in hbase A table named hive_hbase_pro will be created.
④ Data inserted into the mapped Hbase table
#Configure the following parameters in hive:
set hive.hbase.wal.enabled=false;
set hive.hbase.bulk=true;
set hbase.client.scanner.caching=1000000;
⑤Import data:
insert overwrite table hive_hbase_pro select id as row_key ,id,name,age from hive_hbase_test;
At this time, there is data in hive in the hive table:
Supplement: If the table in hbase already exists, only the external table can be created in hive at this time:
create external table hive_hbase_xiaoxu(row_key string,id bigint,name string,age int) STORED BY "org.apache.hadoop.hive.hbase.HBaseStorageHandler" WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:id,info:name,info :age") TBLPROPERTIES ("hbase.table.name"="hive_hbase_pro");
The external table created at this time can read the data in the hbase table.
??Summary: The data inserted in this way is inserted one by one, and the speed is relatively slow. If the order of magnitude is in the millions and tens of millions of machines compared This method can be used in a good situation, and the execution speed is about 2-3W per second.
There are also derivatives of Phoneix and pig, which are similar to hive, so I won’t introduce them here.
2. Bulkload way to import data
This way of importing data is quite fast, because it skips the WAL and directly produces the underlying HFile file.
Advantages:
- BulkLoad will not write WAL, nor will it generate flush and split.
- If we call the PUT interface a lot to insert data, it may cause a lot of GC operations. If the HBase table is not pre-partitioned, it will cause a single machine hot issue. In severe cases, it may even affect the stability of the HBase node. BulkLoad has no such concerns.
-
There is not a large number of interface calls that consume performance during the process.
Steps:
① Upload the data file to HDFS:
Download address: https://down.51cto.com/data/2464129
The contents of this file are separated by commas.$hadoop fs -put sp_address.txt /tmp/sp_addr_bulktable
② Use importtsv command to generate Hfile file
hbase org .apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.bulk.output=/tmpbulkdata/sp_addr_data -Dimporttsv.columns=HBASE_ROW_KEY,sp_address:ID,sp_address:PLACE_TYPE,sp_address:PLACE_CODE PLACE_NAME,sp_address:UP_PLACE_CODE sp_address_bulkload "/tmp/sp_addr_bulktable"
Parameter introduction:
-Dimporttsv.separator: Specify file separator< br>Dimporttsv.bulk.output: The directory of the generated HFile (this directory must not exist)
Dimporttsv.columns: The relationship mapping of the hbase table
< strong>sp_address_bulkload : hbase table name (here, hbase table must be created before hfile is generated)
“/tmp/sp_addr_bulktable”: source data directory
**Table statement: **create'sp_address_bulkload',' sp_address'
③ Import Hfile into Hbase
$hadoop jar /applications/hbase-2.0.5/ lib/hbase-mapreduce-2.0.5.jar completebulkload /tmpbulkdata/sp_addr_data/ sp_address_bulkload
This There is a pit in it. It is said on the Internet that it is in hbase-server-VRESION-hadoop2.jar. The editor here uses version 2.0.5. This completebulkload main class is in the jar package of hbase-mapreduce-2.0.5.jar.
Benefits: The essence of running this command is an mv operation of hdfs, and MapReduce will not be started.
④ View hbase table
$scan'sp_address_bulkload'
At this point, the data is loaded into hbase.
?? Of course, you can also use the API, but the cost of learning will double. If the scenario is not particularly complicated, the shell can basically be solved.
??Summary: This method is the fastest. The principle is to process multiple pieces of data at one time. It is recommended to use this method. It will be quite fast in a real environment. What we tested was more than 400 million pieces of data, and it was done in 20 minutes. Maybe I can’t see what’s fast or fast here. Here I can provide a real situation. In a machine with 256G memory, it takes 27 minutes to import 5000W data with sqoop.