DGraph: A Large-Scale Financial Dataset for Graph Anomaly Detection
DGragh is a program forFigure Anomaly Detection(gragh anomaly detection, GAD)(used form a nominal expression)Large financial data sets。
It contains 300w nodes, 400w dynamic edges and 100w ground-truth nodes.
The authors are focusing on
Large field: graph anomaly detection (GAD)
Real-world scenarios where anomalies are prevalent and destructive. The authors give the example of wire transfer fraud: in 2020 wire transfer fraudsters caused up to $1.8 trillion in losses. However, only 12-15% of cases are reported, and only 29% of victims fully recover their funds. Therefore, studying GAD can help prevent and detect the presence of these fraudsters.
Middle field: anomalous node detection in GAD
Representative in GAD
Difficulties with datasets in the current field
Key requirements for the GAD dataset: bridging the gap between academia and industry
Due to anomalous scarcity in the real world, only a few public data setsGround-truth labeling with both graph structure and anomalies, so enriching the diversity of GAD is fundamental to current GAD research.
Collecting datasets from a number of representative but currently uncovered areas could greatly speed up this process, so this paper focuses on financial fraud detection.
Small area: financial fraud detection
Some limitations of the current GAD dataset:
- small-scale
- Ignoring the temporal dynamics of the graph
- Real world scenarios where not all nodes need to go to be classified/predicted
The authors have optimized for these three problems.
So here is a brief introduction to DGragh:
- DGragh has 3 million nodes as well as 4 million edges and is labeled with 1 million real anomalies
- Since not all nodes need to be classified/predicted in a realistic scenario, but removing nodes loses rich information and compromises the connectivity of the network structure, DGragh refers to these nodes as background nodes and others as target nodes.DGragh retains more than 2 million background nodes (referring to those who are users who are not targets of detection due to lack of borrowing behavior)
- A node in DGragh represents a Finvolution user, and an edge from one user to another indicates that the user has set another user as an emergency contact.
- DGragh contains a wealth of dynamic information
DGragh
raw data
To summarize, the raw data for a particular user consists of five components:
(1) User ID.
(2) Basic profile information such as age, gender, etc.
(3) Phone numbers; note that each account is matched to a specific phone number.
(4) The act of borrowing, including the due date for repayment and the actual date of repayment.
(5) Emergency contacts, including the name, phone number and last update time for each contact.
deceiver
Financial fraudsters often provide false personal information, some of which may also have strange social networks (compared to the average user), and others who behave abnormally in terms of platform operations.
Node Characteristics
The node features derived from the base profile are a 17-dimensional vector, where each dimension corresponds to a different element of the profile (e.g., age and gender).In order to protect user privacy, the authors do not disclose the specific meaning of any dimension.Missing values are marked as "-1".
During the build process, emergency contacts are kept only those that are Finvolution users to protect user privacy.
labeling nodes
32.2% of the nodes in DGragh have a relevant borrowing history.
Definition: Users who, at least once, fail to make a payment long after the due date and ignore the platform's repeated reminders are anomalies/fraudsters.
The other part of the borrowers are normal users, while the other part of the users with no borrowing behavior are labeled as "background nodes".
Observation of abnormal and normal nodes
- Fraudsters and normal users have a similar average degree of entry, but the difference in average degree of exit is significant. Normal users have an average out degree that is 2.33 times higher than that of fraudsters.
- Fraudsters have lower neighbor similarity than normal users on the outgoing edge
- Differences in the number of missing values
- The average time to edge out is lower for fraudsters than for normal users for the same degree of edge out - suggesting that fraudsters are more likely to fill out their emergency contact information in a shorter period of time
There is also a study of background nodes in the article, which will not be repeated here.
Experiments on DGragh
The authors are organized around the following three questions:
Q1: How does the current GAD model perform on DGraph?
Q2: How to handle missing values in DGraph?
Q3: How important are the background nodes of DGraph?
See the paper for conclusions.
basics
What do you mean by "heterogeneous" in heterogeneous GNN?
In Heterogeneous GNN (Heterogeneous Graph Neural Networks), "heterogeneous" refers to the diversity of types of nodes or edges of the graph. Unlike traditional Homogeneous Graphs, Heterogeneous Graphs contain multiple types of nodes and/or multiple types of edges. This diversity often allows for a better representation of complex relationships and information structures.
Specifically, isomorphism in isomorphic maps is mainly reflected in:
nodal isomorphism: There are different types of nodes in a graph. For example, in a social network, nodes can represent "users", "posts" or "comments".
edge isomerism: There are different types of edges in a graph that represent different relationships or interactions. For example, in the same social network, edges can represent "users following users", "users liking posts" or "users commenting on posts".
Heterogeneous GNNs aim to take advantage of this diversity by designing specific network structures or mechanisms to efficiently capture and utilize the complex relationships between different types of nodes and edges, thus improving the ability to model heterogeneous data.