{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Actor-Movie relations from Wikidata\n", "\n", "This Jupyter notebook provides an example of using the Python package [gravis](https://pypi.org/project/gravis). The .ipynb file can be found [here](https://github.com/robert-haas/gravis/tree/master/examples).\n", "\n", "It shows how a **network of actors and movies** can be visualized as bipartite graph (=a graph with two types of nodes, where actor nodes and movie nodes, visually distinguished by color). The data is fetched from **Wikidata** with **SPARQL** (a data query language) and describes the relations between actors and movies they participated in (many entries are missing).\n", "\n", "## References\n", "\n", "- [Wikidata](https://www.wikidata.org)\n", " - [Glossary](https://www.wikidata.org/wiki/Wikidata:Glossary)\n", " - [SPARQL tutorial](https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial)\n", " - [Query examples 1](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries)\n", " - [Query examples 2](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)\n", " - [Example: Characters portrayed by most actors](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples#Characters_portrayed_by_most_actors)\n", " - [List of properties](https://www.wikidata.org/wiki/Wikidata:List_of_properties)\n", "\n", "- Other\n", " - [Tutorial: Where do Mayors Come From: Querying Wikidata with Python and SPARQL](https://janakiev.com/blog/wikidata-mayors/)\n", " - [Your First SPARQL Query](https://docs.data.world/tutorials/sparql/Your_First_Sparql_Query.html)\n", "\n", "- Used here\n", " - Property P18: [image](https://www.wikidata.org/wiki/Property:P18)\n", " - Property P161: [cast member](https://www.wikidata.org/wiki/Property:P161)\n", " - Property P453: [character role](https://www.wikidata.org/wiki/Property:P453)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import random\n", "import string\n", "\n", "import gravis as gv\n", "import networkx as nx\n", "import requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data generation: Fetch data from Wikidata with a SPARQL query\n", "\n", "Goal: Fetch data about actors and movies from Wikidata in order to create a bipartite network of actor-movie relations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def fetch_data(num_tries):\n", " url = 'https://query.wikidata.org/sparql'\n", " query = \"\"\"\n", " SELECT ?filmLabel ?actorLabel ?characterLabel ?actorImage ?movieImage ?characterImage\n", " WHERE {\n", " ?film p:P161 [\n", " ps:P161 ?actor;\n", " pq:P453 ?character\n", " ].\n", " OPTIONAL{\n", " ?film wdt:P18 ?filmImage. # film / has image / filmImage\n", " ?actor wdt:P18 ?actorImage. # actor / has image / actorImage\n", " }\n", " SERVICE wikibase:label { bd:serviceParam wikibase:language \"en\". }\n", " }\n", " LIMIT 100000\n", " \"\"\"\n", " for i in range(num_tries):\n", " try:\n", " print('Try number {}'.format(i+1))\n", " random_string = ''.join(random.choice(string.ascii_letters) for i in range(20))\n", " headers = {'User-Agent': random_string}\n", " params = {'format': 'json', 'query': query}\n", " response = requests.get(url, headers=headers, params=params)\n", " print(response.text)\n", " data = response.json()\n", " break\n", " except Exception:\n", " pass\n", " else:\n", " raise ValueError('Data fetching failed.')\n", " return data\n", "\n", "\n", "data = fetch_data(num_tries=5)\n", "print('Number of items:', len(data['results']['bindings']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a bipartite graph of actors and movies" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "graph = nx.Graph()\n", "\n", "for item in data['results']['bindings']:\n", " movie = item['filmLabel']['value']\n", " actor = 'Actor: ' +item['actorLabel']['value']\n", " character = item['characterLabel']['value']\n", " \n", " # Node type 1: Movie (red)\n", " graph.add_node(movie)\n", " node = graph.nodes[movie]\n", " node['type'] = 'Movie'\n", " node['color'] = 'red'\n", " node['label_color'] = 'red'\n", " \n", " # Node type 2: Actor (black)\n", " graph.add_node(actor)\n", " node = graph.nodes[actor]\n", " node['type'] = 'Actor'\n", " node['color'] = 'black'\n", " \n", " # Edge between different node types\n", " graph.add_edge(movie, actor)\n", "\n", "print('Number of nodes:', len(graph.nodes))\n", "print('Number of edges:', len(graph.edges))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def add_properties(graph):\n", " for node, degree in graph.degree():\n", " graph.nodes[node]['size'] = 10.0 + degree / 10.0\n", "\n", "add_properties(graph)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot filtered versions of the large graph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filter 1: Egocentric network (=neighborhood of a selected node)\n", "\n", "- GSU library: [Ego network](https://research.library.gsu.edu/c.php?g=916490&p=6612505)\n", "- Science direct topic: [Egocentric network](https://www.sciencedirect.com/topics/computer-science/egocentric-network)\n", "\n", "Focus on an actor (\"ego\") and show all edges to his direct neighbors (\"alters\") and between them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ego = 'Actor: Anthony Hopkins'\n", "\n", "ego_graph = nx.ego_graph(graph, ego, radius=2)\n", "ego_graph.nodes[ego]['shape'] = 'rectangle'\n", "ego_graph.nodes[ego]['color'] = 'green'\n", "ego_graph.nodes[ego]['label_color'] = 'green'\n", "\n", "print('Number of nodes:', len(ego_graph.nodes))\n", "print('Number of edges:', len(ego_graph.edges))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gv.d3(ego_graph, node_hover_neighborhood=True, zoom_factor=0.3, node_label_size_factor=0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filter 2: Only well-connected nodes with degree >= n \n", "\n", "Show only actors that play in at least n movies and each movie with at least one such actor." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n = 10\n", "filtered_graph = graph.copy()\n", "\n", "# Step 1\n", "to_remove = [node for node, degree in graph.degree()\n", " if (degree < n and graph.nodes[node]['type'] == 'Actor')]\n", "filtered_graph.remove_nodes_from(to_remove)\n", "\n", "# Step 2\n", "to_remove = [node for node, degree in filtered_graph.degree() if degree < 1]\n", "filtered_graph.remove_nodes_from(to_remove)\n", "\n", "print('Number of nodes:', len(filtered_graph.nodes))\n", "print('Number of edges:', len(filtered_graph.edges))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use a precalculated layout: Fruchterman-Reingold\n", "layout = nx.spring_layout(filtered_graph, iterations=60, scale=5000)\n", "for node_id, (x, y) in layout.items():\n", " node = filtered_graph.nodes[node_id]\n", " node['x'] = x\n", " node['y'] = y" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot it with vis.js as raster image on a canvas (less load on browser than d3.js SVG image)\n", "gv.vis(filtered_graph, node_hover_neighborhood=True, layout_algorithm_active=False, large_graph_threshold=10e10)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 2 }