panther_vector_similarity#

panther_vector_similarity(G, source, *, D=10, k=5, path_length=5, c=0.5, delta=0.1, eps=None, weight='weight', seed=None)[source]#

Returns the Panther vector similarity (Panther++) of nodes in G.

Computes similarity between nodes based on the “Panther++” algorithm [1], which extends the basic Panther algorithm by using feature vectors to better capture structural similarity.

While basic Panther similarity measures how often two nodes appear on the same paths, Panther vector similarity (Panther++) creates a D-dimensional feature vector for each node using its top similarity scores with other nodes, then computes similarity based on the Euclidean distance between these feature vectors. This approach better captures structural similarity and addresses the bias towards close neighbors present in the original Panther algorithm.

This approach is preferred when:

You need better structural similarity than basic path co-occurrence
You want to overcome the close-neighbor bias of standard Panther
You’re working with large graphs where k-d tree indexing would be beneficial
Graph edit distance-like similarity is more appropriate than path co-occurrence

Parameters:

GNetworkX graph: A NetworkX graph
sourcenode: Source node for which to find the top k similar other nodes
Dint: The number of similarity scores to use (in descending order) for each feature vector. Defaults to 10. Note that the original paper used D=50 [1], but KDTree is optimized for lower dimensions.
kint: The number of most similar nodes to return
path_lengthint: How long the randomly generated paths should be (T in [1])
cfloat: A universal constant that controls the number of random paths to generate. Higher values increase the number of sample paths and potentially improve accuracy at the cost of more computation. Defaults to 0.5 as recommended in [1].
deltafloat: The probability that S is not an epsilon-approximation to (R, phi)
epsfloat: The error bound for similarity approximation. This controls the accuracy of the sampled paths in representing the true similarity. Smaller values yield more accurate results but require more sample paths. If None, a value of sqrt(1/|E|) is used, which the authors found empirically effective.
weightstring or None, optional (default=”weight”): The name of an edge attribute that holds the numerical value used as a weight. If None then each edge has weight 1.
seedinteger, random_state, or None (default): Indicator of random number generation state. See Randomness.

Returns:

similaritydict: Dict of nodes to similarity scores (as floats). Note: the self-similarity (i.e., node) is not included in the dict.

Notes

Results may be nondeterministic when feature vectors have the same distances, as the KDTree’s internal tie-breaking behavior can vary between runs. Using the same seed parameter ensures reproducible results.

References

[1] (1,2,3,4)

Zhang, J., Tang, J., Ma, C., Tong, H., Jing, Y., & Li, J. Panther: Fast top-k similarity search on large networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Vol. 2015-August, pp. 1445–1454). Association for Computing Machinery. https://doi.org/10.1145/2783258.2783267.

Examples

>>> G = nx.star_graph(100)

The “hub” node is distinct from the “spoke” nodes

>>> from pprint import pprint
>>> pprint(nx.panther_vector_similarity(G, source=0, seed=42))
{35: 0.10402634656233918,
 61: 0.10434063328712018,
 65: 0.10401247833456054,
 85: 0.10506718868571752,
 88: 0.10402634656233918}

But “spoke” nodes are similar to one another

>>> result = nx.panther_vector_similarity(G, source=1, seed=42)
>>> len(result)
5
>>> all(similarity == 1.0 for similarity in result.values())
True