⭐️ Basic Link Navigation ⭐️
☁️ AliCloud event address
🐟 Commute to Fishing Small website address
💻 Source code repository address
I. Preface
Hello everyone, I'm summo, the first article has taught you how to go to AliCloud to buy a server, and how to build JDK, Redis, MySQL and these environments. In the second article, we have built the back-end application, and completed the first crawler (Jitterbug). Then this one I will teach you how to crawl the data saved to the database, and can be obtained through the interface, for the back of the front-end interface to provide a data source.
II. Table structure design
The structure of the Hotlist is summarized as follows
The table creation statement is as follows
CREATE TABLE `t_sbmy_hot_search` (
`id` bigint(20) unsigned zerofill NOT NULL AUTO_INCREMENT COMMENT 'Physical Primary Key',
`hot_search_id` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL COMMENT 'Hot search ID',
`hot_search_excerpt` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci COMMENT 'Hot search excerpt',
`hot_search_heat` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL COMMENT 'Hot search heat',
`hot_search_title` varchar(2048) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL COMMENT 'Hot search title',
`hot_search_url` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci COMMENT 'Hot search link',
`hot_search_cover` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci COMMENT 'hot search cover', `hot_search_cover` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci COMMENT
`hot_search_author` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL COMMENT 'Hot search author',
`hot_search_author_avatar` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci COMMENT 'Hot search author avatar',
`hot_search_resource` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL COMMENT 'Hot search source',
`hot_search_order` int DEFAULT NULL COMMENT 'Hot Search Ranking',
`gmt_modified` datetime DEFAULT NULL COMMENT 'Updated', `gmt_creator_id` datetime DEFAULT NULL COMMENT
`creator_id` bigint DEFAULT NULL COMMENT 'creator', `modifier_id` datetime DEFAULT NULL COMMENT
`modifier_id` bigint DEFAULT NULL COMMENT 'Updated by', `creator_id` bigint DEFAULT NULL COMMENT
PRIMARY KEY (`id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
One of the core fields is
hot_search_id
、hot_search_title
、hot_search_url
、hot_search_heat
These four fields, these fields are sure to have a value, the other fields look at the hot search interface to give or not to give a value, there is no fill in the empty, the table to build this statement can be executed to create a hot search records table.
Third, the use of plug-ins to generate Java objects
Java mapping table structure object generation there are many ways, here I recommend the use of mybatis plug-in way to generate, the operation is very simple, a good match behind the object can be generated in this way (ps: I use the development tool is the idea of community version).
1. Create a generator directory in the resources directory of the sumo-sbmy-dao module.
2. Create files in the generator directory
It reads as follows
# JDBClink (on a website)
=jdbc:mysql://xxx:3306/summo-sbmy?characterEncoding=utf8&useSSL=false&serverTimezone=Asia/Shanghai&rewriteBatchedStatements=true
# user ID
=root
# cryptographic
=xxx
# table name
=t_sbmy_hot_search
# object name
=SbmyHotSearchDO
# Mapper
=SbmyHotSearchMapper
This is a configuration file, will need to create Java's object table to write to this and then set the object name can be.
3. Create files in the generator directory
It reads as follows
<!DOCTYPE generatorConfiguration
PUBLIC "-////DTD MyBatis Generator Configuration 1.0//EN"
"/dtd/mybatis-generator-config_1_0.dtd">
<generatorConfiguration>
<properties resource="generator/"/>
<!-- MySQLdatabase driverjarpackage path,full path -->
<classPathEntry location="/xxx/mysql-connector-java-8.0."/>
<context targetRuntime="MyBatis3Simple" defaultModelType="flat">
<property name="beginningDelimiter" value="`"/>
<property name="endingDelimiter" value="`"/>
<plugin type="">
<property name="mappers" value=""/>
<!-- The plugin defaults to the genericmapper <property name="mappers" itemValue=""/>-->
<property name="caseSensitive" value="true"/>
<property name="lombok" value="Getter,Setter,Builder,NoArgsConstructor,AllArgsConstructor"/>
</plugin>
<!-- database connection -->
<jdbcConnection driverClass=""
connectionURL="${}"
userId="${}" password="${}"/>
<!-- entitytrails -->
<javaModelGenerator targetPackage=""
targetProject="src/main/java"/>
<!-- xmltrails -->
<sqlMapGenerator targetPackage="mybatis/mapper"
targetProject="src/main/resources/">
<property name="enableSubPackages" value="true"/>
</sqlMapGenerator>
<!-- mappertrails -->
<javaClientGenerator targetPackage=""
targetProject="src/main/java"
type="XMLMAPPER">
</javaClientGenerator>
<!-- additiveIDtrails -->
<table tableName="${}" domainObjectName="${}"
mapperName="${}">
<generatedKey column="id" sqlStatement="Mysql" identity="true"/>
</table>
</context>
</generatorConfiguration>
This is the generated logic rules, set the location of the specified entity, dao, mapper and other files. After these two files are configured, refresh the maven repository and the plugin will automatically recognize it.
4. double-click mybatis-generator:generate to generate the object
Find the location where the plugin is executed
Double-click mybatis-generator:generate to generate the corresponding DO, Mapper, xml, but other classes like controller, service and repository can not be generated, you need to create their own. The final directory structure is as follows
The repository code here was added by me, the plugin can't generate it, I also added a
AbstractBaseDO
, which is an abstract parent class that all DOs inherit; also added aMetaObjectHandlerConfig
, which is used for slicing and dicing SQL, and some of the code is a bit fine-tuned at the end. I'll post the code, details can be found in my code repository.
package ;
import ;
import .*;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
@Getter
@Setter
@TableName("t_sbmy_hot_search")
@NoArgsConstructor
@AllArgsConstructor
@Builder
@ToString
public class SbmyHotSearchDO extends AbstractBaseDO<SbmyHotSearchDO> {
/**
* Physical Primary Key
*/
@TableId(type = )
private Long id;
/**
* Hot Titles
*/
@Column(name = "hot_search_title")
private String hotSearchTitle;
/**
* Popular authors
*/
@Column(name = "hot_search_author")
private String hotSearchAuthor;
/**
* Hot Sources
*/
@Column(name = "hot_search_resource")
private String hotSearchResource;
/**
* Hot Search Ranking
*/
@Column(name = "hot_search_order")
private Integer hotSearchOrder;
/**
* popular searchID
*/
@Column(name = "hot_search_id")
private String hotSearchId;
/**
* popular search热度
*/
@Column(name = "hot_search_heat")
private String hotSearchHeat;
/**
* popular search链接
*/
@Column(name = "hot_search_url")
private String hotSearchUrl;
/**
* popular search封面
*/
@Column(name = "hot_search_cover")
private String hotSearchCover;
/**
* Popular authors头像
*/
@Column(name = "hot_search_author_avatar")
private String hotSearchAuthorAvatar;
/**
* popular search摘录
*/
@Column(name = "hot_search_excerpt")
private String hotSearchExcerpt;
}
package ;
import ;
import ;
import ;
@Mapper
public interface SbmyHotSearchMapper extends BaseMapper<SbmyHotSearchDO> {
}
package ;
import ;
import ;
public interface SbmyHotSearchRepository extends IService<SbmyHotSearchDO> {
}
package ;
import ;
import ;
import ;
import ;
import ;
@Repository
public class SbmyHotSearchRepositoryImpl extends ServiceImpl<SbmyHotSearchMapper, SbmyHotSearchDO>
implements SbmyHotSearchRepository {
}
package ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
@Getter
@Setter
public class AbstractBaseDO<T extends Model<T>> extends Model<T> implements Serializable {
/**
* Creation time
*/
@TableField(fill = )
private Date gmtCreate;
/**
* Modify Time
*/
@TableField(fill = FieldFill.INSERT_UPDATE)
private Date gmtModified;
/**
* founderID
*/
@TableField(fill = )
private Long creatorId;
/**
* modifierID
*/
@TableField(fill = FieldFill.INSERT_UPDATE)
private Long modifierId;
}
package ;
import ;
import ;
import ;
import ;
import ;
@Configuration
public class MetaObjectHandlerConfig implements MetaObjectHandler {
@Override
public void insertFill(MetaObject metaObject) {
Date date = ().getTime();
(metaObject, "gmtCreate", date);
(metaObject, "gmtModified", date);
}
@Override
public void updateFill(MetaObject metaObject) {
Date date = ().getTime();
("gmtModified", date, metaObject);
}
}
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE mapper PUBLIC "-////DTD Mapper 3.0//EN" "/dtd/">
<mapper namespace="">
<resultMap type="">
<!--
WARNING - @
-->
<id column="id" jdbcType="BIGINT" property="id" />
<result column="hot_search_title" jdbcType="VARCHAR" property="hotSearchTitle" />
<result column="hot_search_author" jdbcType="VARCHAR" property="hotSearchAuthor" />
<result column="hot_search_resource" jdbcType="VARCHAR" property="hotSearchResource" />
<result column="hot_search_order" jdbcType="INTEGER" property="hotSearchOrder" />
<result column="gmt_create" jdbcType="TIMESTAMP" property="gmtCreate" />
<result column="gmt_modified" jdbcType="TIMESTAMP" property="gmtModified" />
<result column="creator_id" jdbcType="BIGINT" property="creatorId" />
<result column="modifier_id" jdbcType="BIGINT" property="modifierId" />
<result column="hot_search_id" jdbcType="VARCHAR" property="hotSearchId" />
<result column="hot_search_heat" jdbcType="VARCHAR" property="hotSearchHeat" />
<result column="hot_search_url" jdbcType="LONGVARCHAR" property="hotSearchUrl" />
<result column="hot_search_cover" jdbcType="LONGVARCHAR" property="hotSearchCover" />
<result column="hot_search_author_avatar" jdbcType="LONGVARCHAR" property="hotSearchAuthorAvatar" />
<result column="hot_search_excerpt" jdbcType="LONGVARCHAR" property="hotSearchExcerpt" />
</resultMap>
</mapper>
IV. Hot search data storage
1. Unique ID generation
At the end of the last article we got the Jitterbug's hot search data, here the table structure is also designed, to the storage logic becomes much simpler. There is a point to note, because the Jitterbug hot search does not come with a unique ID, in order not to repeat the add, we need to manually set an ID for the hot search, the algorithm for generating the ID is as follows:
/**
* Get a unique ID based on the article title
* @param title
* @param title Article title
* @return Unique ID
*/
public static String getHashId(String title) {
long seed = ();
Random rnd = new Random(seed); return new UUID((), ())
return new UUID((), ()).toString();
}
public static void main(String[] args) {
(getHashId("When you have a fat cat it's skinny")); }
(getHashId("When you have a fat cat it's skinny")); {
(getHashId("When you have a fat cat it's skinny"));.
(getHashId("When you have a fat cat it's skinny"));;
}
Run this string of logic and output the code
From the output, the same title gets the same hashId, which can fulfill our requirement.
2. Data storage processes
The logical flow of saving is shown below, the core is to go to the database to query according to that unique ID, and if there is, skip it, if not, save it.
DouyinHotSearchJob code is as follows
package ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import .slf4j.Slf4j;
import ;
import ;
import ;
import ;
import ;
import ;
import static ;
/**
* @author summo
* @version , 1.0.0
* @description Shake Shack hot searchJavacrawler code
* @date 2024surname Nian08moon09
*/
@Component
@Slf4j
public class DouyinHotSearchJob {
@Autowired
private SbmyHotSearchService sbmyHotSearchService;
/**
* Timed Trigger Crawler Method,1Performed once an hour
*/
@Scheduled(fixedRate = 1000 * 60 * 60)
public void hotSearch() throws IOException {
try {
//查询Shake Shack hot search数据
OkHttpClient client = new OkHttpClient().newBuilder().build();
Request request = new ().url(
"/web/api/v2/hotsearch/billboard/word/").method("GET", null).build();
Response response = (request).execute();
JSONObject jsonObject = (().string());
JSONArray array = ("word_list");
List<SbmyHotSearchDO> sbmyHotSearchDOList = ();
for (int i = 0, len = (); i < len; i++) {
//Get the information about the hot searches of Zhihu
JSONObject object = (JSONObject)(i);
//Build a hot search information list
SbmyHotSearchDO sbmyHotSearchDO = ().hotSearchResource(()).build();
//Setting the article title
(("word"));
//Setting up the Shake Shack three-wayID
(getHashId(() + ()));
//Setting up article links
(
"/search/" + () + "?type=general");
//Setting the heat of a hot search
(("hot_value"));
//ordinal order
(i + 1);
(sbmyHotSearchDO);
}
//Data persistence
sbmyHotSearchService.saveCache2DB(sbmyHotSearchDOList);
} catch (IOException e) {
("Get Jitterbug Data Exception", e);
}
}
/**
* Gets a unique value based on the title of the articleID
*
* @param title article title
* @return uniqueID
*/
public static String getHashId(String title) {
long seed = ();
Random rnd = new Random(seed);
return new UUID((), ()).toString();
}
}
Here's the crawler code I've fine-tuned to remove some unnecessary headers and cookies.
The data storage logic is as follows
@Override
public Boolean saveCache2DB(List<SbmyHotSearchDO> sbmyHotSearchDOS) {
if ((sbmyHotSearchDOS)) {
return ;
}
//Queries whether the current data already exists
List<String> searchIdList = ().map(SbmyHotSearchDO::getHotSearchId).collect(
());
List<SbmyHotSearchDO> sbmyHotSearchDOList = (
new QueryWrapper<SbmyHotSearchDO>().lambda().in(SbmyHotSearchDO::getHotSearchId, searchIdList));
//Filtering data that already exists
if ((sbmyHotSearchDOList)) {
List<String> tempIdList = ().map(SbmyHotSearchDO::getHotSearchId).collect(
());
sbmyHotSearchDOS = ().filter(
sbmyHotSearchDO -> !(())).collect(());
}
if ((sbmyHotSearchDOS)) {
return ;
}
("This new addition[{}]data entry", ());
//Batch Add
return (sbmyHotSearchDOS);
}
Here the code is what we often say CRUD, the feeling that there is nothing to say, the specific code to see the warehouse.
V. To summarize
Because we are a project, there are a lot of framework-level and engineering things, details are very much, if I post all the code, then the article probably can not read, so the subsequent articles I will only post some core code and logic. The main thing is that there are some things can not be a moment to tell you clearly, involving the design of the scaffolding, so you first look at it, the first brain to copy the code in, there is no understanding of the place where I can open a separate article later to introduce and explain why I have to do this and do so what are the benefits.
In addition to the code, I will also teach you the use of some plug-ins, such as the above code generator, similar to this I will be a lot of things, I will be slowly posted in the later articles, so that you learn some really good and fun things. And from this article I will release my warehouse address, the subsequent code will continue to be updated to this warehouse, due to Github many people can not access, I will use the domestic Gitee it, the address is as follows:.
Extra: Baidu Hot Search Crawler
1. Evaluation of the reptile program
Baidu hot search is this, and the interface is this:/board?tab=realtime&sa=fyb_realtime_31065
You can see that the data is still complete, the title, cover, heat, summary are all there, but Baidu Hot Search and Jitterbug Hot Search are not the same, this interface returns HTML pages are not JSON data, so we need to use packages that deal with HTML tags:
jsoup
, depends on the following:
<!-- jsoup -->
<dependency>
<groupId></groupId>
<artifactId>jsoup</artifactId>
<version>1.12.1</version>
</dependency>
2. Web page parsing code
This Postman with Postman is not very good, we call directly with jsonp, not forcing the logic, directly on the code, BaiduHotSearchJob:
package ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import .slf4j.Slf4j;
import ;
import ;
import ;
import ;
import ;
import ;
import static ;
/**
* @author summo
* @version , 1.0.0
* @description Baidu's hot search engineJavacrawler code
* @date 2024surname Nian08moon19
*/
@Component
@Slf4j
public class BaiduHotSearchJob {
@Autowired
private SbmyHotSearchService sbmyHotSearchService;
/**
* Timed Trigger Crawler Method,1Performed once an hour
*/
@Scheduled(fixedRate = 1000 * 60 * 60)
public void hotSearch() throws IOException {
try {
//获取Baidu's hot search engine
String url = "/board?tab=realtime&sa=fyb_realtime_31065";
List<SbmyHotSearchDO> sbmyHotSearchDOList = new ArrayList<>();
Document doc = (url).get();
//caption
Elements titles = (".c-single-text-ellipsis");
//photograph
Elements imgs = (".category-wrap_iQLoo .index_1Ew5p").next("img");
//element
Elements contents = (".hot-desc_1m_jR.large_nSuFU");
//testimonials
Elements urls = (".category-wrap_iQLoo -wrapper_29V76");
//Hot Search Index
Elements levels = (".hot-index_1Bl1a");
for (int i = 0; i < (); i++) {
SbmyHotSearchDO sbmyHotSearchDO = ().hotSearchResource(()).build();
//设置文章caption
((i).text().trim());
//Setting up Baidu's three-wayID
(getHashId(() + ()));
//Setting the article cover
((i).attr("src"));
//Setting up article summaries
((i).text().replaceAll("View More>", ""));
//Setting up article links
((i).attr("href"));
//Setting the heat of a hot search
((i).text().trim());
//ordinal order
(i + 1);
(sbmyHotSearchDO);
}
//Data persistence
sbmyHotSearchService.saveCache2DB(sbmyHotSearchDOList);
} catch (IOException e) {
("Get Baidu data anomaly", e);
}
}
/**
* 根据文章caption获取一个uniqueID
*
* @param title 文章caption
* @return uniqueID
*/
public static String getHashId(String title) {
long seed = ();
Random rnd = new Random(seed);
return new UUID((), ()).toString();
}
}
In fact, the web version of the data crawler is not difficult, the key is to see if you can quickly find the data stored in the label, and then through the selector to get to the label's attributes or content, Jsoup framework in the parsing of the dom is very good to use, but also commonly used in the crawler code.