Neo4j and Java: fast and random sample of Iterable<Relationship> -
i've coded traversal in java returns iterable. worst case iterable size of 850784 relationships.
objective: want sample (without replacement) 20 relationships , want fast.
solution 1 : performing tolist()
or casting in sort of collection
takes time (> 1 minute). know have taken avantage of shuffle()
function, etc. unacceptable.
solution 2: thus, in order directly on iterable
, i've used guava collect
library, have included time in milliseconds (calculated system.nanotime()
, dividing 1000000), each of 3 steps below. need take size of iterable
random number generator, real bottleneck.
/* traversal: 5 ms */ iterable<relationship> simrels = traversal1.traverse(user).relationships(); /* iterable size: 74669 ms */ int simrelssize = com.google.common.collect.iterables.size(simrels); /* random sample of 20: 28321 ms*/ long seed = system.nanotime(); int[] idxs = new int[20]; random randomgenerator = new xsrandom(seed); (int = 0; < idxs.length; ++i){ int randomint = randomgenerator.nextint(simrelssize); idxs[i]=randomint; } arrays.sort(idxs); list<relationship> simrelslist2 = new arraylist<relationship>(); for(int = 0; < idxs.length; ++i){ if (i > 0) { int pos = idxs[i]-idxs[i-1]; simrelslist2.add(com.google.common.collect.iterables.get(simrels, pos)); } else{ simrelslist2.add(com.google.common.collect.iterables.get(simrels, idxs[i])); } }
how can optimize code go faster ?
note: have windows 8.1 pc, i5 2.30ghz, ram 16gb, hdd 1tb
as per michal's request, please find file contents below:
neo4j-wrapper
#******************************************************************** # property file references #******************************************************************** wrapper.java.additional=-dorg.neo4j.server.properties=conf/neo4j-server.properties wrapper.java.additional=-djava.util.logging.config.file=conf/logging.properties wrapper.java.additional=-dlog4j.configuration=file:conf/log4j.properties #******************************************************************** # jvm parameters #******************************************************************** wrapper.java.additional=-xx:+useconcmarksweepgc wrapper.java.additional=-xx:+cmsclassunloadingenabled wrapper.java.additional=-xx:-omitstacktraceinfastthrow # remote jmx monitoring, uncomment , adjust following lines needed. # make sure update jmx.access , jmx.password files appropriate permission roles , passwords, # shipped configuration contains read role called 'monitor' password 'neo4j'. # more details, see: http://download.oracle.com/javase/7/docs/technotes/guides/management/agent.html # on unix based systems jmx.password file needs owned user run server, # , have permissions set 0600. # details on setting these file permissions on windows see: # http://docs.oracle.com/javase/7/docs/technotes/guides/management/security-windows.html #wrapper.java.additional=-dcom.sun.management.jmxremote.port=3637 #wrapper.java.additional=-dcom.sun.management.jmxremote.authenticate=true #wrapper.java.additional=-dcom.sun.management.jmxremote.ssl=false #wrapper.java.additional=-dcom.sun.management.jmxremote.password.file=conf/jmx.password #wrapper.java.additional=-dcom.sun.management.jmxremote.access.file=conf/jmx.access # systems cannot discover host name automatically, , need line configured: #wrapper.java.additional=-djava.rmi.server.hostname=$the_neo4j_server_hostname # uncomment following lines enable garbage collection logging #wrapper.java.additional=-xloggc:data/log/neo4j-gc.log #wrapper.java.additional=-xx:+printgcdetails #wrapper.java.additional=-xx:+printgcdatestamps #wrapper.java.additional=-xx:+printgcapplicationstoppedtime #wrapper.java.additional=-xx:+printpromotionfailure #wrapper.java.additional=-xx:+printtenuringdistribution # java heap size: default java heap size dynamically # calculated based on available system resources. # uncomment these lines set specific initial , maximum # heap size in mb. wrapper.java.initmemory=8192 wrapper.java.maxmemory=10240 #******************************************************************** # wrapper settings #******************************************************************** # path relative bin dir wrapper.pidfile=../data/neo4j-server.pid #******************************************************************** # wrapper windows nt/2000/xp service properties #******************************************************************** # warning - not modify of these properties when application # using configuration file has been installed service. # please uninstall service before modifying section. # service can reinstalled. # name of service wrapper.name=neo4j # user account used linux installs. default current # user if not set. wrapper.user=
neo4j.properties
# enable able upgrade store older version. #allow_store_upgrade=true # amount of memory use mapping store files, either in bytes or # percentage of available memory. clipped @ amount of # free memory observed when database starts, , automatically rounded # down nearest whole page. example, if "500mb" configured, # 450mb of memory free when database starts, database # map @ 450mb. if "50%" configured, , system has capacity of # 4gb, @ 2gb of memory mapped, unless database observes # less 2gb of memory free when starts. #mapped_memory_total_size=50% # enable specify parser other default one. #cypher_parser_version=2.0 # keep logical logs, helps debugging uses more disk space, enabled # legacy reasons limit space needed store historical logs use values such # as: "7 days" or "100m size" instead of "true". #keep_logical_logs=7 days # autoindexing # enable auto-indexing nodes, default false. #node_auto_indexing=true # node property keys auto-indexed, if enabled. #node_keys_indexable=name,age # enable auto-indexing relationships, default false. #relationship_auto_indexing=true # relationship property keys auto-indexed, if enabled. #relationship_keys_indexable=name,age # enable shell server remote clients can connect via neo4j shell. #remote_shell_enabled=true # network interface ip shell listen on (use 0.0.0 interfaces). #remote_shell_host=127.0.0.1 # port shell listen on, default 1337. #remote_shell_port=1337 # type of cache use nodes , relationships. #cache_type=hpc # maximum size of heap memory dedicate cached nodes. #node_cache_size= # maximum size of heap memory dedicate cached relationships. #relationship_cache_size= # enable online backups taken database. online_backup_enabled=true # port listen incoming backup requests. online_backup_server=127.0.0.1:6362 # uncomment , specify these lines running neo4j in high availability mode. # see high availability setup tutorial more details on these settings # http://neo4j.com/docs/2.2.0-m02/ha-setup-tutorial.html # ha.server_id number of each instance in ha cluster. should # integer (e.g. 1), , should unique each cluster instance. #ha.server_id= # ha.initial_hosts comma-separated list (without spaces) of host:port # ha.cluster_server of instances listening. typically # same cluster instances. #ha.initial_hosts=192.168.0.1:5001,192.168.0.2:5001,192.168.0.3:5001 # ip , port instance listen on, communicating cluster status # information iwth other instances (also see ha.initial_hosts). ip # must configured ip address 1 of local interfaces. #ha.cluster_server=192.168.0.1:5001 # ip , port instance listen on, communicating transaction # data other instances (also see ha.initial_hosts). ip # must configured ip address 1 of local interfaces. #ha.server=192.168.0.1:6001 # interval @ slaves pull updates master. comment out # option disable periodic pulling of updates. unit seconds. ha.pull_interval=10 # amount of slaves master try push transaction upon commit # (default 1). master optimistically continue , not fail # transaction if fails reach push factor. setting 0 # increase write performance when writing through master potentially # lead branched data (or loss of transaction) if master goes down. #ha.tx_push_factor=1 # strategy master use when pushing data slaves (if push factor # greater 0). there 2 options available "fixed" (default) or # "round_robin". fixed start pushing slaves ordered server id # (highest first) improving performance since slaves have cache # 1 transaction @ time. #ha.tx_push_strategy=fixed # policy how handle branched data. #branched_data_policy=keep_all # clustering timeouts # default timeout. #ha.default_timeout=5s # how heartbeat messages should sent. defaults ha.default_timeout. #ha.heartbeat_interval=5s # timeout heartbeats between cluster members. should @ least twice of ha.heartbeat_interval. #heartbeat_timeout=11s
the reason neo4j returning iterable executes traversal whilst you're iterating. in order sample, i'm afraid have "visit" every relationship. yes, can skip some, still have iterate through of them @ end of day.
we're using "reservoir sampling" algorithm this, implemented here. not sure it's gonna perform better though, reason described above. said should able sample 1m relationships in less 1 second warm cache. if it's taking longer that, may need tweak memory settings bit.
Comments
Post a Comment