Location>code7788 >text

10W data volume import and existing table data de-duplication

Popularity:41 ℃/2024-08-16 17:54:01

Using PostgreSQL

In doing a large amount of data (100,000 level) import, certain fields and database table data (ten million) duplication needs to be excluded, the table data query to use the program way to determine the de-emphasis of the efficiency is very low, and then consider the use of temporary tables.

  1. First insert the new data into the temporary table, which has the same structure as the original table.
  2. Query the non-duplicated data DataA using SQL.
  3. Insert DataA into the original table.

Because of the non-repeating data I have to do some other processing, so check out DataA, if you do not need to do special processing can be used directlyinsert into select Insert the data from step 2 into the original table in the same way as in step 2, and then empty the temporary table.

There are two ways to do step 2, one is to use thenot exists in a manner such as

SELECT
	* 
FROM
	tableTemp t1 
WHERE
	NOT EXISTS ( SELECT 1 FROM tableName WHERE columnA =  AND columnB =  )

The second way is to useleft join

SELECT
	* 
FROM
	tableTemp t1
	LEFT JOIN tableName t2 ON  =  AND  =  
WHERE
	 IS NULL

Tested (everyone's table structure and indexes are different, I'm just providing a reference here)

When the amount of data in the temporary table is small, thenot exists It took less time, with more data it took longer. When the data reaches 10w, it takes 25s.

When the amount of data in the temporary table is small, theleft join It took 30s, not much change with more data, and 40s when the data reached 10w.

Conclusion 1: When de-duplicating a single table, as long as the amount of imported data is not particularly large (20w or more), it is preferred to use not exists to do de-duplication.

But there is another case where you need to do de-duplication of two tables.

for example

SELECT
	* 
FROM
	tableTemp t1 
WHERE
	NOT EXISTS ( 
	SELECT 1 FROM tableNameA WHERE columnA =  AND columnB =  
	union all
	select 1 from tableNameB WHERE columnA =  AND columnB =  
	);

SELECT
	* 
FROM
	tableTemp t1
	LEFT JOIN tableNameA t2 ON  =  AND  =  
	LEFT JOIN tableNameB t3 ON  =  AND  =  
WHERE
	 IS NULL
	AND  IS NULL

In this case.

When there is little data in the temporary table, thenot exists It took less time, with more data it took longer. When the data reached 10w, it took 150s!!!!

When there is little data in the temporary table, theleft join The time taken is still 30s, and it gets longer as the amount of data gets bigger. When the data reaches 10w, the time taken is still 40s.

Both are comparable in terms of time spent at 3w data size

Conclusion 2: For double table de-duplication, use not exists when the imported data is below 3w, and use left join when it is above 3w.