python - Disk-Read slows-down INSERT in MySQL -


i trying optimize speed of large insert queries on mariadb (10.0.31) on innodb table.

here structure of table (131 million rows):

field__     type___     null    key     default        id_num_     bigint(45)  no      pri     null     content     varchar(250)yes             null     user_id     bigint(24)  no      mul     null     location    varchar(70) yes             null     date_creat  datetime    no      mul     null     retweet_ct  int(7)      no              null     isretweet   tinyint(1)  no              null     hasreetwet  tinyint(1)  no              null     original    bigint(45)  yes             null     url____     varchar(150)yes             null     favorite_c  int(7)      no              null     selected    int(11)     no              0    sentiment   int(11)     no              0    

here output of create table:

create table `twit` (  `id_num` bigint(45) not null,  `content` varchar(250) collate utf8mb4_unicode_ci default null,  `user_id` bigint(24) not null,  `location` varchar(70) collate utf8mb4_unicode_ci default null,  `date_create` datetime not null,  `retweet_count` int(7) not null,  `isretweet` tinyint(1) not null,  `hasreetweet` tinyint(1) not null,  `original` bigint(45) default null,  `url` varchar(150) collate utf8mb4_unicode_ci default null,  `favorite_count` int(7) not null,  `selected` int(11) not null default '0',  `sentiment` int(11) not null default '0',  primary key (`id_num`),  key `user_id` (`user_id`) ) engine=innodb default charset=utf8mb4 collate=utf8mb4_unicode_ci 

here structure of indexes:

table   non_unique  key_name    seq_in_index    column_name     collation   cardinality     sub_part packed     null    index_type  comment     index_comment    twit    0           primary     1               id_num                    124139401       null     null       btree        twit    1           user_id     1               user_id                   535083          null     null       btree        

here show engine innodb status:

buffer pool , memory ---------------------- total memory allocated 8942256128; in additional pool allocated 0 total memory allocated read views 184 internal hash tables (constant factor + variable factor)    adaptive hash index 141954688 (141606424 + 348264)    page hash           4426024 (buffer pool 0 only)    dictionary cache    35656039 (35403184 + 252855)    file system         845872 (812272 + 33600)    lock system         21251648 (21250568 + 1080)    recovery system     0 (0 + 0) dictionary memory allocated 252855 buffer pool size        524286 buffer pool size, bytes 8589901824 free buffers            448720 database pages          75545 old database pages      27926 modified db pages       0 percent of dirty pages(lru & free pages): 0.000 max dirty pages percent: 75.000 pending reads 0 pending writes: lru 0, flush list 0, single page 0 pages made young 0, not young 0 0.00 youngs/s, 0.00 non-youngs/s pages read 74639, created 906, written 39133 0.12 reads/s, 0.00 creates/s, 0.00 writes/s buffer pool hit rate 999 / 1000, young-making rate 0 / 1000 not 0 / 1000 pages read ahead 0.00/s, evicted without access 0.00/s, random read ahead 0.00/s lru len: 75545, unzip_lru len: 0 i/o sum[0]:cur[0], unzip sum[0]:cur[0] 

i use following python code download data 3rd party source, fill table it:

add_twit = (" insert table (id_num, content,....) values (%s, %s, ....)") testtime=0 t0 = time.time() data_twit = []  #### data retrieving  #### page in limit_handled(...):     status in page:         data_twit.append(processed_tweet) ####   ##### mysql insert  tt0 = time.time() cursorsql.executemany(add_twit, data_twit) testtime += time.time() - tt0 ####  cnx.commit() print('total_time ' + str(time.time()-t0)) print('sqlexecute_time ' + str(testtime)) 

what code do:

it gets twits 3rd party provider, 16 pages of them, 200 twits (status) per page, total of 3200 rows added table every iteratin (user). tried inserting 1 query per tweet (with cursorsql.execute(add_twit, data_twit), , 16 queries of 200 tweets in list too, fastest few seconds making 1 query of 3200 tweets using optimized cursorsql.executemany function.

for 3200 tweets, takes 10 seconds download them , 75 seconds write them database, seems lot considering 1 tweet (row) takes 0.2ko in table, hence 3200 640 ko. shouldn't take 75 seconds...

what happens when monitoring disk-usage iotop:

  • during data-retrieving part of code (after first iteration):
    • read = 0.00 b/s
    • write= 6.50 m/s

disk keeps writing several minutes @ 6mbs/s rate after large insert

  • during sql-insert part of code:

    • read = 1.5 m/s
    • write= 300 k/s

    looks disk reading (for index purposes guess?) makes writing rate fall down.

what tried:

  • try split insert queries (instead of 1*3200 rows tried 16*200 rows , 3200*1 row, didn't change anything, 1*3200 fastest)

  • optimize table (gained 15% speed)

  • remove unnecessary index

my questions:

  • why disk starts reading when submit insert query instead of writing? there way prevent that?
  • would remove index speed-up insert?

  • do need remove primary-key (not column, unique index on it), though sounds bad idea, , ( mysql slows down after insert ) suggests not to?

  • any other suggestions?
  • also, why disk keep writing @ 6.00 mb/s speed minutes after large insert?

  • about 60gb in table?
  • about 5gb in user_id index? (see index_length in show table status 'twit.)
  • each insert has 3200 new rows? if wrong, main problem.
  • you calculating id_num instead of using auto_increment?
  • id_num monotonically increasing? (or @ least approximately.) if wrong, main problem.
  • user_id quite random.

analysis , conclusions:

  • the data being "appended to"; not have impact on cache (buffer_pool, 8gb).
  • the user_id index being randomly updated; keeps of index in cache or, possibly, spills. if have started spilling, performance on decline, , worse , worse cache-misses increase.
  • "i/o continues after write" -- normal. gory details, "innodb change buffering". summary: updates of index(user_id) delayed, must occur eventually.

partial solutions:

  • more ram.
  • increase innodb_buffer_pool_size 70% of ram; sure not lead swapping.
  • surely don't have more 4 billion users? shrink user_id int unsigned (4 bytes) bigint (8 bytes). shrink secondary index 25%.
  • drop index(user_id) -- surely need it?
  • do use id_num somewhere else? if not, please explain existence.
  • change null not null appropriate. (won't speed, cleanup.)
  • use auto_increment instead of hand-rolled id. (may not help.)

benchmarking:

  • i not use 'raw' i/o metrics -- confused 'blocking' of innodb , change buffer.
  • wait "steady state". is, avoid small table, cold machine, burst, etc. graph of how long each 3200 took have ups , downs because of such things. reach "steady state". but, based on analysis of secondary index, may decline point of 3200 rows taking 32 seconds (if using spinning disk).
  • 3200 in 75 seconds not make sense. think need see generated sql.

Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

c# - Asp.net web api : redirect unauthorized requst to forbidden page -