python - Disk-Read slows-down INSERT in MySQL -
i trying optimize speed of large insert queries on mariadb (10.0.31) on innodb table.
here structure of table (131 million rows):
field__ type___ null key default id_num_ bigint(45) no pri null content varchar(250)yes null user_id bigint(24) no mul null location varchar(70) yes null date_creat datetime no mul null retweet_ct int(7) no null isretweet tinyint(1) no null hasreetwet tinyint(1) no null original bigint(45) yes null url____ varchar(150)yes null favorite_c int(7) no null selected int(11) no 0 sentiment int(11) no 0
here output of create table
:
create table `twit` ( `id_num` bigint(45) not null, `content` varchar(250) collate utf8mb4_unicode_ci default null, `user_id` bigint(24) not null, `location` varchar(70) collate utf8mb4_unicode_ci default null, `date_create` datetime not null, `retweet_count` int(7) not null, `isretweet` tinyint(1) not null, `hasreetweet` tinyint(1) not null, `original` bigint(45) default null, `url` varchar(150) collate utf8mb4_unicode_ci default null, `favorite_count` int(7) not null, `selected` int(11) not null default '0', `sentiment` int(11) not null default '0', primary key (`id_num`), key `user_id` (`user_id`) ) engine=innodb default charset=utf8mb4 collate=utf8mb4_unicode_ci
here structure of indexes:
table non_unique key_name seq_in_index column_name collation cardinality sub_part packed null index_type comment index_comment twit 0 primary 1 id_num 124139401 null null btree twit 1 user_id 1 user_id 535083 null null btree
here show engine innodb status
:
buffer pool , memory ---------------------- total memory allocated 8942256128; in additional pool allocated 0 total memory allocated read views 184 internal hash tables (constant factor + variable factor) adaptive hash index 141954688 (141606424 + 348264) page hash 4426024 (buffer pool 0 only) dictionary cache 35656039 (35403184 + 252855) file system 845872 (812272 + 33600) lock system 21251648 (21250568 + 1080) recovery system 0 (0 + 0) dictionary memory allocated 252855 buffer pool size 524286 buffer pool size, bytes 8589901824 free buffers 448720 database pages 75545 old database pages 27926 modified db pages 0 percent of dirty pages(lru & free pages): 0.000 max dirty pages percent: 75.000 pending reads 0 pending writes: lru 0, flush list 0, single page 0 pages made young 0, not young 0 0.00 youngs/s, 0.00 non-youngs/s pages read 74639, created 906, written 39133 0.12 reads/s, 0.00 creates/s, 0.00 writes/s buffer pool hit rate 999 / 1000, young-making rate 0 / 1000 not 0 / 1000 pages read ahead 0.00/s, evicted without access 0.00/s, random read ahead 0.00/s lru len: 75545, unzip_lru len: 0 i/o sum[0]:cur[0], unzip sum[0]:cur[0]
i use following python code download data 3rd party source, fill table it:
add_twit = (" insert table (id_num, content,....) values (%s, %s, ....)") testtime=0 t0 = time.time() data_twit = [] #### data retrieving #### page in limit_handled(...): status in page: data_twit.append(processed_tweet) #### ##### mysql insert tt0 = time.time() cursorsql.executemany(add_twit, data_twit) testtime += time.time() - tt0 #### cnx.commit() print('total_time ' + str(time.time()-t0)) print('sqlexecute_time ' + str(testtime))
what code do:
it gets twits 3rd party provider, 16 pages of them, 200 twits (status) per page, total of 3200 rows added table every iteratin (user). tried inserting 1 query per tweet (with cursorsql.execute(add_twit, data_twit)
, , 16 queries of 200 tweets in list too, fastest few seconds making 1 query of 3200 tweets using optimized cursorsql.executemany
function.
for 3200 tweets, takes 10 seconds download them , 75 seconds write them database, seems lot considering 1 tweet (row) takes 0.2ko in table, hence 3200 640 ko. shouldn't take 75 seconds...
what happens when monitoring disk-usage iotop
:
- during data-retrieving part of code (after first iteration):
- read = 0.00 b/s
- write= 6.50 m/s
disk keeps writing several minutes @ 6mbs/s rate after large insert
during sql-insert part of code:
- read = 1.5 m/s
- write= 300 k/s
looks disk reading (for index purposes guess?) makes writing rate fall down.
what tried:
try split insert queries (instead of 1*3200 rows tried 16*200 rows , 3200*1 row, didn't change anything, 1*3200 fastest)
optimize table (gained 15% speed)
remove unnecessary index
my questions:
- why disk starts reading when submit insert query instead of writing? there way prevent that?
would remove index speed-up insert?
do need remove primary-key (not column, unique index on it), though sounds bad idea, , ( mysql slows down after insert ) suggests not to?
- any other suggestions?
- also, why disk keep writing @ 6.00 mb/s speed minutes after large insert?
- about 60gb in table?
- about 5gb in user_id index? (see index_length in
show table status 'twit
.) - each
insert
has 3200 new rows? if wrong, main problem. - you calculating id_num instead of using
auto_increment
? - id_num monotonically increasing? (or @ least approximately.) if wrong, main problem.
- user_id quite random.
analysis , conclusions:
- the data being "appended to"; not have impact on cache (buffer_pool, 8gb).
- the
user_id
index being randomly updated; keeps of index in cache or, possibly, spills. if have started spilling, performance on decline, , worse , worse cache-misses increase. - "i/o continues after write" -- normal. gory details, "innodb change buffering". summary: updates of
index(user_id)
delayed, must occur eventually.
partial solutions:
- more ram.
- increase
innodb_buffer_pool_size
70% of ram; sure not lead swapping. - surely don't have more 4 billion users? shrink
user_id
int unsigned
(4 bytes)bigint
(8 bytes). shrink secondary index 25%. drop index(user_id)
-- surely need it?- do use
id_num
somewhere else? if not, please explain existence. - change
null
not null
appropriate. (won't speed, cleanup.) - use
auto_increment
instead of hand-rolled id. (may not help.)
benchmarking:
- i not use 'raw' i/o metrics -- confused 'blocking' of innodb , change buffer.
- wait "steady state". is, avoid small table, cold machine, burst, etc. graph of how long each 3200 took have ups , downs because of such things. reach "steady state". but, based on analysis of secondary index, may decline point of 3200 rows taking 32 seconds (if using spinning disk).
- 3200 in 75 seconds not make sense. think need see generated sql.
Comments
Post a Comment