SQLite import tab file: .IMPORT is a plurality of inserts or use transactions to group them?

I import millions of rows from a tab file, and the SQLite .import .mode tab is very slow. I have three indexes, so maybe the index is slow. But first I want to check Does .import group batches/all these lines into one commit. I can’t find documentation on how .import works. Does anyone know? If indexing is the problem (I have encountered mysql problems before) how to disable it at the end of .import and re-index?

[Update 1]

Follow @sixfeetsix comments.

My architecture is:

< pre>CREATE TABLE ensembl_vf_b36 (
variation_name varchar(20),
chr varchar(4),
start integer,
end integer,
strand varchar(5),
allele_string varchar(3),
map_weight varchar(2),
flags varchar(50),
validation_status varchar(100),
consequence_type varchar(50)< br />);
CREATE INDEX pos_vf_b36_idx on ensembl_vf_b36 (chr, start, end);

Data:

rs35701516 NT_113875 352 352 1 G/A 2 NULL NULL INTERGENIC
rs12090193 NT_113875 566 566 1 G/A 2 NULL NULL INTERGENIC
rs35448845 NT_113875 758 758 1 A/C 2 NULL NULL INTERGENIC
rs17274850 NT_113875 1758 1758 1 G/ A 2 genot yped cluster,freq INTERGENIC

There are 15_608_032 entries in this table

These are statistical data

$ time sqlite3 -separator ' 'test_import.db'.import variations_build_36_ens-54.tab ensembl_vf_b36'

real 29m27.643s
user 4m14.176s
sys 0m15.204s

[ Update 2]

@sixfeetsix has a good answer, if you are reading this, you will also be interested

Faster bulk inserts in sqlite3?

Sqlite3: Disabling primary key index while inserting?

[update3] 30-minute solution-> 4 minutes

Even all optimizations (see accepted answer) still take nearly 30 minutes , But if the index is not used and added at the end, the total time is 4 minutes:

-- importing without indexes:
real 2m22.274s
user 1m38.836s
sys 0m4.850s

- adding indexes
$ time sqlite3 ensembl-test-b36.db
real 2m18.344s
user 1m26.264s
sys 0m6.422s

I believe that as more and more records are added, index building is really slow. Depending on the RAM you have, you can tell sqlite to use enough memory so that all these index building activities are in It is done in memory (i.e. there is no all I/O, otherwise less memory will occur).

For 15M records, I would say that you should set the cache size to 500000.

< p>You can also tell sqlite to keep its transaction log in memory.

Finally, you can set synchronous to OFF so that sqlite never waits for writes to be committed to disk.

< p>Using this, I was able to divide the time required to import 15M records by 5 (14 minutes reduced to 2.5), the random GUID records were divided into 5 columns, and the three middle columns were used as indexes:

< /p>

b40c1c2f 912c 46c7 b7a0 3a7d8da724c1
9c1cdf2e e2bc 4c60 b29d e0a390abfd26
b9691a9b b0db 4f33 a066 43cb4f7cf873b13701f137e2a4f7cf873
b9691a9b b0db 4f33 a066 43cb4f7cf873b01f 137e45a 4f6a 9f 137eprev >

So try this I suggest you put all the instructions in some files, such as import_test:

pragma journal_mode=memory;
pragma synchronous= 0;
pragma cache_size=500000;
.mode tabs
.import variations_build_36_ens-54.tab ensembl_vf_b36

Then try it out:

time sqlite3 test_import.db 

Edit

This is the answer to Pablo(OP)'s comment (a long time as a comment):
My (educated) guess is:

>Because .import itself is not sql,
it doesn’t have many
transactions, I even tend to
think it’s written Go
Even faster than you have it all
Done in a "normal" transaction;
and,
>such as If you have enough memory
distribution, you set your
the environment I suggest, the real
(time) the pig here is reading the apartment
the document, and then write the final content
database , Because what happens between the two is extreme
fast; that is, there is fast enough
there is not much time to get through optimization
when you compare this potential
on Disk I / O spend (probably) incompressible time.

If I am wrong, although I am happy to hear why it is for my own benefit.

Edit 2

< p>I conducted a comparison test between the index in place during the .import, and added the index immediately after the .import was completed. I used the same technique to generate a 15M record composed of split random UUIDs:

import csv, uuid
w = csv.writer(open('bla.tab','wb'), dialect='excel-tab')
for i in xrange(15000000):
w.writerow(str(uuid.uuid4()).split('-'))

Then I tested using the index import created before and after (Index is created here):

pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=500000;
create table test (f1 text, f2 text, f3 text, f4 text, f5 text);
CREATE INDEX test_idx on test (f2, f3, f4);
.mode tabs
.import bla .tab test

So here is the time to add the index before:

[someone@somewhere ~]$time sqlite3 test_speed.sqlite memory

real 2m58.839s
user 2m21.411 s
sys 0m6.086s

and add the index after the following:

[someone@somewhere ~]$time sqlite3 test_speed.sqlite < import_test 
memory

real 2m19.261s
user 2m12.531s
sys 0m4.403s

You see the "user" time difference (~ 9s) How to not consider the full time difference (~40s)? For me, this means that some additional I/O will occur when the index is created before, so I think everything is done in memory without additional I/O.

Conclusion: After creating the index, you will get better import time (as Donal mentioned).

I import millions of rows from a tab file, and SQLite The .import .mode tab is very slow. I have three indexes, so maybe the index is slow. But first I want to check if the .import groups batches/all these lines into one commit. I can’t find how the .import works Documentation. Does anyone know? If indexing is the problem (I have encountered mysql problems before) how to disable it at the end of .import and re-index?

[Update 1]

Follow @sixfeetsix comments.

My architecture is:

< pre>CREATE TABLE ensembl_vf_b36 (
variation_name varchar(20),
chr varchar(4),
start integer,
end integer,
strand varchar(5),
allele_string varchar(3),
map_weight varchar(2),
flags varchar(50),
validation_status varchar(100),
consequence_type varchar(50)< br />);
CREATE INDEX pos_vf_b36_idx on ensembl_vf_b36 (chr, start, end);

Data:

rs35701516 NT_113875 352 352 1 G/A 2 NULL NULL INTERGENIC
rs12090193 NT_113875 566 566 1 G/A 2 NULL NULL INTERGENIC
rs35448845 NT_113875 758 758 1 A/C 2 NULL NULL INTERGENIC
rs17274850 NT_113875 1758 1758 1 G/ A 2 genotyped cluster,freq INTERGENIC

There are 15_608_032 entries in this table

These are statistical data

$ time sqlite3 -separator '' test_import.db'.import variations_build_36_ens-54.tab ensembl_vf_b36'

real 29m27.643s
user 4m14.176s
sys 0m15.204s

[Update 2]

@sixfeetsix has a good answer, if you are reading this, you will also be interested

Faster bulk inserts in sqlite3?

Sqlite3 : Disabling primary key index while inserting?

[update3] 30-minute solution-> 4 minutes

Even all optimizations (see the accepted answer) still take nearly 30 minutes, But if the index is not used and added at the end, the total time is 4 minutes:

-- importing without indexes:
real 2m22.274s
user 1m38.836s
sys 0m4.850s

- adding indexes
$ time sqlite3 ensembl-test-b36.db
real 2m18.344s
user 1m26.264s
sys 0m6.422s

I believe that with more and more records Is added, building the index is really slow. Depending on the RAM you have, you can tell sqlite to use enough memory so that all these index building activities are done in memory (i.e. without all I/O, otherwise less memory will happen) .

For 15M records, I would say that you should set the cache size to 500000.< /p>

You can also tell sqlite to keep its transaction log in memory.

Finally, you can set synchronous to OFF so that sqlite never waits for writes to be committed to disk. /p>

Using this, I was able to divide the time required to import 15M records by 5 (14 minutes reduced to 2.5), the random GUID records were divided into 5 columns, and three middle columns were used as indexes:

b40c1c2f 912c 46c7 b7a0 3a7d8da724c1
9c1cdf2e e2bc 4c60 b29d e0a390abfd26
b9691a9b b0db 4f33 a066 43cb4fe 4f33 a066 43cb4f7cf3f33 a066 43cb4f7cfaa1f 137360 1fae2ae 4f33 a066 43cb4f7cfaa1 45b /> b99e6c299528

So try this I suggest you put all the instructions in some files, such as import_test:

pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=500000;
.mode tabs
.import variations_build_36_ens-54.tab ensembl_vf_b36

Then try it:

< p>

time sqlite3 test_import.db 

Edit

This is the answer to Pablo (OP) after commenting (a long time as a comment ):
My (educated) guess is:

>Because .import itself is not sql,
it doesn’t have many
transactions, I even tend to
think It’s written to go
even faster than you have it all
completed in a "normal" transaction;
and,
>if you have enough memory
distribution, you set Your
The environment I suggest, the real
(time) The pig here is reading the apartment
The document, and then write the final content
The database, because what happened between the two is extreme
fast; that is, there is fast enough
there is not much time to get through optimization
when you compare this potential
on disk I/O spends (probably) incompressible time.

If I’m wrong, although I’m happy to hear why it’s for my own benefit.

Edit 2

I conducted a comparison test between the index in place during .import, and added the index immediately after the completion of the .import. I used the same technique to generate 15M records composed of split random UUIDs:

import csv, uuid
w = csv.writer(open('bla.tab','wb'), dialect='excel-tab')
for i in xrange(15000000):
w.writerow(str(uuid.uuid4()).split('-'))

Then I tested the one created before and after using Index import (index is created here):

pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=500000;
create table test (f1 text, f2 text, f3 text, f4 text, f5 text);
CREATE INDEX test_idx on test (f2, f3, f4);
.mode tabs
. import bla.tab test

So here is the time to add the index before:

[someone@somewhere ~]$time sqlite3 test_speed.sqlite memory

real 2m58.839s
user 2m21.411s
sys 0m6.086s

and add the index after the following:

[someone@somewhere ~]$time sqlite3 te st_speed.sqlite memory

real 2m19.261s
user 2m12.531s
sys 0m4.403s

You see "user "How about the time difference (~9s) not taking into account the full time difference (~40s)? For me, this means that some additional I/O will occur when the index is created before, so I think everything is done in memory without additional I/O.

Conclusion: After creating the index, you will get better import time (as Donal mentioned).

Leave a Comment

Your email address will not be published.