python - Disk-Read slows-down INSERT in MySQL -

i trying optimize speed of large insert queries on mariadb (10.0.31) on innodb table.

here structure of table (131 million rows):

field__     type___     null    key     default        id_num_     bigint(45)  no      pri     null     content     varchar(250)yes             null     user_id     bigint(24)  no      mul     null     location    varchar(70) yes             null     date_creat  datetime    no      mul     null     retweet_ct  int(7)      no              null     isretweet   tinyint(1)  no              null     hasreetwet  tinyint(1)  no              null     original    bigint(45)  yes             null     url____     varchar(150)yes             null     favorite_c  int(7)      no              null     selected    int(11)     no              0    sentiment   int(11)     no              0

here output of create table:

create table `twit` (  `id_num` bigint(45) not null,  `content` varchar(250) collate utf8mb4_unicode_ci default null,  `user_id` bigint(24) not null,  `location` varchar(70) collate utf8mb4_unicode_ci default null,  `date_create` datetime not null,  `retweet_count` int(7) not null,  `isretweet` tinyint(1) not null,  `hasreetweet` tinyint(1) not null,  `original` bigint(45) default null,  `url` varchar(150) collate utf8mb4_unicode_ci default null,  `favorite_count` int(7) not null,  `selected` int(11) not null default '0',  `sentiment` int(11) not null default '0',  primary key (`id_num`),  key `user_id` (`user_id`) ) engine=innodb default charset=utf8mb4 collate=utf8mb4_unicode_ci

here structure of indexes:

table   non_unique  key_name    seq_in_index    column_name     collation   cardinality     sub_part packed     null    index_type  comment     index_comment    twit    0           primary     1               id_num                    124139401       null     null       btree        twit    1           user_id     1               user_id                   535083          null     null       btree

here show engine innodb status:

buffer pool , memory ---------------------- total memory allocated 8942256128; in additional pool allocated 0 total memory allocated read views 184 internal hash tables (constant factor + variable factor)    adaptive hash index 141954688 (141606424 + 348264)    page hash           4426024 (buffer pool 0 only)    dictionary cache    35656039 (35403184 + 252855)    file system         845872 (812272 + 33600)    lock system         21251648 (21250568 + 1080)    recovery system     0 (0 + 0) dictionary memory allocated 252855 buffer pool size        524286 buffer pool size, bytes 8589901824 free buffers            448720 database pages          75545 old database pages      27926 modified db pages       0 percent of dirty pages(lru & free pages): 0.000 max dirty pages percent: 75.000 pending reads 0 pending writes: lru 0, flush list 0, single page 0 pages made young 0, not young 0 0.00 youngs/s, 0.00 non-youngs/s pages read 74639, created 906, written 39133 0.12 reads/s, 0.00 creates/s, 0.00 writes/s buffer pool hit rate 999 / 1000, young-making rate 0 / 1000 not 0 / 1000 pages read ahead 0.00/s, evicted without access 0.00/s, random read ahead 0.00/s lru len: 75545, unzip_lru len: 0 i/o sum[0]:cur[0], unzip sum[0]:cur[0]

i use following python code download data 3rd party source, fill table it:

add_twit = (" insert table (id_num, content,....) values (%s, %s, ....)") testtime=0 t0 = time.time() data_twit = []  #### data retrieving  #### page in limit_handled(...):     status in page:         data_twit.append(processed_tweet) ####   ##### mysql insert  tt0 = time.time() cursorsql.executemany(add_twit, data_twit) testtime += time.time() - tt0 ####  cnx.commit() print('total_time ' + str(time.time()-t0)) print('sqlexecute_time ' + str(testtime))

what code do:

it gets twits 3rd party provider, 16 pages of them, 200 twits (status) per page, total of 3200 rows added table every iteratin (user). tried inserting 1 query per tweet (with cursorsql.execute(add_twit, data_twit), , 16 queries of 200 tweets in list too, fastest few seconds making 1 query of 3200 tweets using optimized cursorsql.executemany function.

for 3200 tweets, takes 10 seconds download them , 75 seconds write them database, seems lot considering 1 tweet (row) takes 0.2ko in table, hence 3200 640 ko. shouldn't take 75 seconds...

what happens when monitoring disk-usage iotop:

during data-retrieving part of code (after first iteration):
- read = 0.00 b/s
- write= 6.50 m/s

disk keeps writing several minutes @ 6mbs/s rate after large insert

during sql-insert part of code:
- read = 1.5 m/s
- write= 300 k/s
looks disk reading (for index purposes guess?) makes writing rate fall down.

what tried:

try split insert queries (instead of 1*3200 rows tried 16*200 rows , 3200*1 row, didn't change anything, 1*3200 fastest)
optimize table (gained 15% speed)
remove unnecessary index

my questions:

why disk starts reading when submit insert query instead of writing? there way prevent that?
would remove index speed-up insert?
do need remove primary-key (not column, unique index on it), though sounds bad idea, , ( mysql slows down after insert ) suggests not to?
any other suggestions?
also, why disk keep writing @ 6.00 mb/s speed minutes after large insert?

about 60gb in table?
about 5gb in user_id index? (see index_length in show table status 'twit.)
each insert has 3200 new rows? if wrong, main problem.
you calculating id_num instead of using auto_increment?
id_num monotonically increasing? (or @ least approximately.) if wrong, main problem.
user_id quite random.

analysis , conclusions:

the data being "appended to"; not have impact on cache (buffer_pool, 8gb).
the user_id index being randomly updated; keeps of index in cache or, possibly, spills. if have started spilling, performance on decline, , worse , worse cache-misses increase.
"i/o continues after write" -- normal. gory details, "innodb change buffering". summary: updates of index(user_id) delayed, must occur eventually.

partial solutions:

more ram.
increase innodb_buffer_pool_size 70% of ram; sure not lead swapping.
surely don't have more 4 billion users? shrink user_id int unsigned (4 bytes) bigint (8 bytes). shrink secondary index 25%.
drop index(user_id) -- surely need it?
do use id_num somewhere else? if not, please explain existence.
change null not null appropriate. (won't speed, cleanup.)
use auto_increment instead of hand-rolled id. (may not help.)

benchmarking:

i not use 'raw' i/o metrics -- confused 'blocking' of innodb , change buffer.
wait "steady state". is, avoid small table, cold machine, burst, etc. graph of how long each 3200 took have ups , downs because of such things. reach "steady state". but, based on analysis of secondary index, may decline point of 3200 rows taking 32 seconds (if using spinning disk).
3200 in 75 seconds not make sense. think need see generated sql.

Search This Blog

How Y

python - Disk-Read slows-down INSERT in MySQL -

Comments

Post a Comment

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

reflection - How to access the object-members of an object declaration in kotlin -

php - Doctrine Query Builder Error on Join: [Syntax Error] line 0, col 87: Error: Expected Literal, got 'JOIN' -