Working on the optimal/hybrid artist, I have been further testing on how a machine could play surrealistic word-association games.
Instead of associating by syntax as shown in Synthetic word-associating games (I), I want the machine to associate more like a human – by context.
How can we teach this to the computer? We need give the machine some kind of context, the most obvious being an on-line context. My applet does the following using a searchWord.
- Search for the searchWord in google
- Open the resulting pages
- Find words at a close distance to the searchWord
- Count the occurrences of these words on all pages
- Make a chance distribution from these and create a new searchWord
Opening a Google page in java is not as you would expect. Google checks the UserAgent value and sees that java is an unknown browser and will not return any results. You need to fool Google into believing that your java virtual machine is a known browser:
// fake the user agent:
// this is the user agent I am currently using.
String MyUserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en; rv:1.9.2.17) Gecko/20110420 Firefox/3.6.17";
// Open connection and tell the URLConnection which user agent you are
URL url = new URL(urlString);
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent", MyUserAgent);
// now continue reading the file
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((String str = in.readLine()) != null) {
// ...
}
After the google result pages have been loaded, a lot of code-stripping needs to be done to actually find meaningful text. Make sure to strip scripts as well, since they look like language.
After this I am also checking my words with an English dictionary.
While writing this article, I came across another word association application where you can play with the computer, each taking a turn associating. Try it.
Here are a few resulting word associating lists. The number in brackets are the number of milliseconds taken to find the next word. As opposed to Synthetic word-associating games (I) where words were found usually under 1 second, here the time taken for each words are on the scale of 10-50 seconds.
MURDER
degree (15218)
academic (38697)
main (25216)
static (14400)
class (26739)
upper (37278)
social (20683)
term (23356)
easter (18544)
date (24205)
day (22997)
seven (25399)
sound (51433)
speed (23601)
download (28482)
bottom (11050)
margin (22834)
left (34867)
politics (17866)
history (21236)
CHRSTIAN
katy (21786)
news (35267)
fox (26500)
community (28833)
more (24806)
read (18476)
reading (13701)
festival (65650)
food (19333)
network (26216)
computer (30773)
personal (22295)
company (14017)
limited (49951)
darjeeling (19220)
district (33827)
county (31305)
ARTIST
watercolor (22248)
painting (36036)
art (26014)
decorative (32349)
history (38820)
social (17902)
anthropology (21896)
cultural (20971)
american (55345)
import (10570)
trade (21797)
free (44367)
software (22549)
system (55482)
use (20733)
fair (17950)
source (22634)
counter (25273)
ring (17710)
diamond (62717)
COMPUTER
personal (21520)
company (14545)
limited (50118)
himself (15215)
dated (31271)
date (26903)
day (22779)
earth (25247)
surface (19450)
follow (32749)
nose (13835)
see (19066)
dictionary (13271)
cambridge (11583)
university (20680)
state (20455)
nation (31718)
content (36159)
before (16207)
present (30025)