python - Comparison between one element and all the others of a DataFrame column -


i have list of tuples turned dataframe thousands of rows, this:

                                          frag         mass  prot_position 0                               tfdehnapnsnsnk  1573.675712              2 1                                epganaigmvafk  1303.659458             29 2                                         gtik   417.258734              2 3                                     spwpsmar   930.438172             44 4                                         lpak   427.279469             29 5                          nedsfvvweqiinslsalk  2191.116099             17 ... 

and have follow rule:

def are_dif(m1, m2, ppm=10):     if abs((m1 - m2) / m1) < ppm * 0.000001:         v = false     else:         v = true     return v 

so, want "frag"s have mass difers other fragments mass. how can achieve "selection"?

then, have list named "pinfo" contains:

d = {'id':id, 'seq':seq_code, "1hw_fit":hits_fit} # 1 each protein # each dictionary position of protein describes. 

so, want sum 1 "hits_fit" value, on dictionary respective protein.

if i'm understanding correctly (not sure if am), can accomplish quite bit sorting. first though, let me adjust data have mix of close , far values mass:

   unnamed: 0                 frag         mass  prot_position 0           0       tfdehnapnsnsnk  1573.675712              2 1           1        epganaigmvafk  1573.675700             29 2           2                 gtik   417.258734              2 3           3             spwpsmar   417.258700             44 4           4                 lpak   427.279469             29 5           5  nedsfvvweqiinslsalk  2191.116099             17 

then think can following select "good" ones. first, create 'pdiff' (percent difference) see how close mass nearest neighbors:

ppm = .00001 df = df.sort('mass')  df['pdiff'] = (df.mass-df.mass.shift()) / df.mass     unnamed: 0                 frag         mass  prot_position         pdiff 3           3             spwpsmar   417.258700             44           nan 2           2                 gtik   417.258734              2  8.148421e-08 4           4                 lpak   427.279469             29  2.345241e-02 1           1        epganaigmvafk  1573.675700             29  7.284831e-01 0           0       tfdehnapnsnsnk  1573.675712              2  7.625459e-09 5           5  nedsfvvweqiinslsalk  2191.116099             17  2.817926e-01 

the first , last data lines make little tricky next line backfills first line , repeats last line following mask works correctly. works example here, might need tweaked other cases (but far first , last lines of data concerned).

df = df.iloc[range(len(df))+[-1]].bfill() df[ (df['pdiff'] > ppm) & (df['pdiff'].shift(-1) > ppm) ] 

results:

   unnamed: 0                 frag         mass  prot_position     pdiff 4           4                 lpak   427.279469             29  0.023452 5           5  nedsfvvweqiinslsalk  2191.116099             17  0.281793 

sorry, don't understand second part of question @ all.

edit add: mentioned in comment @amitavory's answer, think possibly sorting approach , groupby approach combined simpler answer this. might try @ later time, should feel free give shot if interested.


Comments

Popular posts from this blog

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -

node.js - Using Node without global install -

php - CakePHP HttpSockets send array of paramms -