python - How can we define filters/properties for a elasticsearch index using Spark with Scala? -
i used following function in python initialize index in elasticsearch.
def init_index(): constants.es_client.indices.create( index = constants.index_name, body = { "settings": { "index": { "type": "default" }, "number_of_shards": 1, "number_of_replicas": 1, "analysis": { "filter": { "ap_stop": { "type": "stop", "stopwords_path": "stoplist.txt" }, "shingle_filter" : { "type" : "shingle", "min_shingle_size" : 2, "max_shingle_size" : 5, "output_unigrams": true } }, "analyzer": { constants.analyzer_name : { "type": "custom", "tokenizer": "standard", "filter": ["standard", "ap_stop", "lowercase", "shingle_filter", "snowball"] } } } } } ) new_mapping = { constants.type_name: { "properties": { "text": { "type": "string", "store": true, "index": "analyzed", "term_vector": "with_positions_offsets_payloads", "search_analyzer": constants.analyzer_name, "index_analyzer": constants.analyzer_name } } } } constants.es_client.indices.put_mapping ( index = constants.index_name, doc_type = constants.type_name, body = new_mapping ) using function able create index user-defined specs.
i started work scala , spark. integrating elasticsearch can either use spark's api i.e. org.elasticsearch.spark or can use hadoop org.elasticsearch.hadoop. of examples see related hadoop's methodology don't wish use hadoop here. went through spark-elasticsearch documentation , able atleast index documents without including hadoop noticed created default, can't specify _id there. generates _id on own.
in scala use following code indexing (not complete code):
val document = mutable.map[string, string]() document("id") = docid document("text") = textchunk.mkstring(" ") //textchunk list of strings sc.makerdd(seq(document)).savetoes("es_park_ap/document") this created index way:
{ "es_park_ap": { "mappings": { "document": { "properties": { "id": { "type": "string" }, "text": { "type": "string" } } } }, "settings": { "index": { "creation_date": "1433006647684", "uuid": "qnxctamgqgkx7rp-h8fvig", "number_of_replicas": "1", "number_of_shards": "5", "version": { "created": "1040299" } } } } } so if pass document it, following document created:
{ "_index": "es_park_ap", "_type": "document", "_id": "au2l2ixcaorl_gagnja5", "_score": 1, "_source": { "text": "some large text", "id": "12345" } } just python, how can use spark , scala create index user defined specifications?
i think should divide question several smaller issues.
if want create index specific mapping / settings should use elasticsearch java api directly (you can use scala code of course). can use following sources examples of index creating using scala:
creating index , adding mapping in elasticsearch java api gives missing analyzer errors
define custom elasticsearch analyzer using java api
elasticsearch hadoop / spark plugin used in order transport data hdfs es. es maintenance should done separately.
the fact still seeing automatically generated id because must specify plugin id field using following syntax:
esspark.savetoes(rdd, "spark/docs", map("es.mapping.id" -> "your_id_field")) or in case:
sc.makerdd(seq(document)).savetoes("es_park_ap/document", map("es.mapping.id" -> "your_id_field")) you can find more details syntax , proper use here:
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
michael
Comments
Post a Comment