01 Core transformations

Initialization

To register the custom Pandas accessor .etl it’s enough to import blueetl:

import blueetl

The Pandas DataFrames df_col (columnar) and df (multiindex) created below will be useful to show how the etl accessor can be used in the two cases.

import numpy as np
import pandas as pd

np.random.seed(0)
df_col = pd.DataFrame(
    {
        "simulation_id": [0] * 100 + [1] * 75 + [2] * 25,
        "circuit_id": 0,
        "neuron_class": "L23_EXC",
        "window": "w1",
        "trial": 0,
        "time": np.random.rand(200) + 2,
        "gid": np.random.randint(20, size=200),
    }
)
df_col.loc[df_col["gid"] < 10, "neuron_class"] = "L23_INH"

df = df_col.set_index(["simulation_id", "circuit_id", "neuron_class", "window", "trial"])
display(df)
time gid
simulation_id circuit_id neuron_class window trial
0 0 L23_INH w1 0 2.548814 6
0 2.715189 7
L23_EXC w1 0 2.602763 18
0 2.544883 11
0 2.423655 19
... ... ... ... ... ... ...
2 0 L23_INH w1 0 2.490459 2
L23_EXC w1 0 2.227415 13
0 2.254356 15
L23_INH w1 0 2.058029 4
0 2.434417 9

200 rows × 2 columns

Filtering

DataFrames and Series can be filtered on both columns and levels of the multiindex using etl.q, using the same syntax.

Multiple keywords can be specified as parameters, and all the conditions are matched in the resulting dataframe or series.

Each parameter can be:

  • a scalar value matched for equality

  • a list to match any contained scalar value

  • a dict for more complex expressions. The supported operators are: eq, ne, le, lt, ge, gt, isin, regex

Select by equality

If filtering by equality over the index or the columns, the query can be expressed simply as:

df.etl.q(simulation_id=1, neuron_class="L23_EXC", gid=[9, 10, 11])
time gid
simulation_id circuit_id neuron_class window trial
1 0 L23_EXC w1 0 2.447125 10
0 2.581273 10
0 2.725254 11
0 2.290078 10
0 2.135474 11
0 2.011714 11
0 2.199997 11

The columnar dataframe df_col can be filtered using the same parameters:

df_col.etl.q(simulation_id=1, neuron_class="L23_EXC", gid=[9, 10, 11])
simulation_id circuit_id neuron_class window trial time gid
110 1 0 L23_EXC w1 0 2.447125 10
117 1 0 L23_EXC w1 0 2.581273 10
120 1 0 L23_EXC w1 0 2.725254 11
129 1 0 L23_EXC w1 0 2.290078 10
132 1 0 L23_EXC w1 0 2.135474 11
166 1 0 L23_EXC w1 0 2.011714 11
172 1 0 L23_EXC w1 0 2.199997 11

Select by range

It’s possible to get the same result using the operators ge and le to filter on gid in the following way:

df.etl.q(simulation_id=1, neuron_class="L23_EXC", gid={"ge": 9, "le": 11})
time gid
simulation_id circuit_id neuron_class window trial
1 0 L23_EXC w1 0 2.447125 10
0 2.581273 10
0 2.725254 11
0 2.290078 10
0 2.135474 11
0 2.011714 11
0 2.199997 11

Select by regular expression

In general, the selection based on regular expressions is less performant than the other types of filters, so it should be used as a last resort.

When needed, it can be used in case of columns containing string objects.

Internally, the comparison is handled by re.search from the standard Python library, so any substring can match the expression.

See the official documentation for more information, but remember that you can:

  • use the prefix ^ to match from the beginning of the string

  • use the suffix $ to match up to the end of the string

df.etl.q(simulation_id=1, neuron_class={"regex": "^L2"}, gid=[2, 15])
time gid
simulation_id circuit_id neuron_class window trial
1 0 L23_EXC w1 0 2.881103 15
0 2.643990 15
L23_INH w1 0 2.618015 2
0 2.703889 2

Select by variable keys

If the keys used to filter are variable, it’s possible to pass a dict instead of named parameters:

for key, value in [("simulation_id", 1), ("gid", 10)]:
    result = df.etl.q({key: value})
    display(result.head())
    print(f"### Filtered by {key}={value}")
time gid
simulation_id circuit_id neuron_class window trial
1 0 L23_INH w1 0 2.677817 7
0 2.270008 9
L23_EXC w1 0 2.735194 17
L23_INH w1 0 2.962189 9
0 2.248753 0
### Filtered by simulation_id=1
time gid
simulation_id circuit_id neuron_class window trial
0 0 L23_EXC w1 0 2.832620 10
0 2.568434 10
0 2.438602 10
0 2.368725 10
0 2.976459 10
### Filtered by gid=10

Alternatively, it’s possible to use the standard python dict unpacking syntax with **:

for key, value in [("simulation_id", 1), ("gid", 10)]:
    result = df.etl.q(**{key: value})
    display(result.head())
    print(f"### Filtered by {key}={value}")
time gid
simulation_id circuit_id neuron_class window trial
1 0 L23_INH w1 0 2.677817 7
0 2.270008 9
L23_EXC w1 0 2.735194 17
L23_INH w1 0 2.962189 9
0 2.248753 0
### Filtered by simulation_id=1
time gid
simulation_id circuit_id neuron_class window trial
0 0 L23_EXC w1 0 2.832620 10
0 2.568434 10
0 2.438602 10
0 2.368725 10
0 2.976459 10
### Filtered by gid=10

Union of selections

Besides passing a dict or named parameters to etl.q, it’s possible to pass a list of dictionaries.

In this way, the method will return the union of the rows selected by each dictionary.

For performance reasons, you should pass a list of dictionaries only when the filter cannot be simply expressed using a single dictionary.

For example:

df_col.etl.q(
    [
        {"simulation_id": 1, "neuron_class": "L23_EXC", "gid": [10]},
        {"simulation_id": 1, "neuron_class": "L23_INH", "gid": [9]},
    ]
)
simulation_id circuit_id neuron_class window trial time gid
101 1 0 L23_INH w1 0 2.270008 9
103 1 0 L23_INH w1 0 2.962189 9
110 1 0 L23_EXC w1 0 2.447125 10
117 1 0 L23_EXC w1 0 2.581273 10
122 1 0 L23_INH w1 0 2.956084 9
129 1 0 L23_EXC w1 0 2.290078 10
152 1 0 L23_INH w1 0 2.162493 9
167 1 0 L23_INH w1 0 2.359978 9

Comparison with plain Pandas

Let’s consider the following query:

df.etl.q(
    simulation_id=1,
    window=["w1", "w2"],
    gid={"ge": 8, "lt": 12},
    time={"lt": 2.1},
)
time gid
simulation_id circuit_id neuron_class window trial
1 0 L23_EXC w1 0 2.011714 11
L23_INH w1 0 2.054338 8

The same result can be obtained with plain Pandas, but it can be more verbose, and more difficult to chain the results, as shown in the example below.

df.loc[
    (df.index.get_level_values('simulation_id') == 1)
    & (df.index.get_level_values('window').isin(["w1", "w2"]))
    & (df['gid'] >= 8)
    & (df['gid'] < 12)
    & (df['time'] < 2.1)
]
time gid
simulation_id circuit_id neuron_class window trial
1 0 L23_EXC w1 0 2.011714 11
L23_INH w1 0 2.054338 8

As an alternative, it’s possible to use the Pandas query() method as shown below. However, this operation can be slower than the previous methods.

df.query("simulation_id==1 and window==['w1', 'w2'] and gid >= 8 and gid < 12 and time < 2.1")
time gid
simulation_id circuit_id neuron_class window trial
1 0 L23_EXC w1 0 2.011714 11
L23_INH w1 0 2.054338 8

When a query cannot be built with etl.q, it’s still possible to use the plain Pandas syntax, since the returned objects are DataFrames or Series.

Working with MultiIndexes

Retrieve the names of all the MultiIndex levels (conditions):

df.etl.conditions()
FrozenList(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'trial'])

Retrieve the difference between all the conditions and the specified conditions:

df.etl.complementary_conditions(['trial', 'simulation_id'])
FrozenList(['circuit_id', 'neuron_class', 'window'])

Retrieve the unique labels for each level:

df.etl.labels()
[Index([0, 1, 2], dtype='int64', name='simulation_id'),
 Index([0], dtype='int64', name='circuit_id'),
 Index(['L23_INH', 'L23_EXC'], dtype='object', name='neuron_class'),
 Index(['w1'], dtype='object', name='window'),
 Index([0], dtype='int64', name='trial')]

or for specific levels:

df.etl.labels(["neuron_class"])
[Index(['L23_INH', 'L23_EXC'], dtype='object', name='neuron_class')]

It’s possible to remove one or more conditions. However, note that the resulting MultiIndex may be not unique anymore.

df.etl.remove_conditions(['window', 'trial'])
time gid
simulation_id circuit_id neuron_class
0 0 L23_INH 2.548814 6
L23_INH 2.715189 7
L23_EXC 2.602763 18
L23_EXC 2.544883 11
L23_EXC 2.423655 19
... ... ... ... ...
2 0 L23_INH 2.490459 2
L23_EXC 2.227415 13
L23_EXC 2.254356 15
L23_INH 2.058029 4
L23_INH 2.434417 9

200 rows × 2 columns

It’s also possible to specify only the conditions to keep:

df.etl.keep_conditions(['window', 'trial'])
time gid
window trial
w1 0 2.548814 6
0 2.715189 7
0 2.602763 18
0 2.544883 11
0 2.423655 19
... ... ...
0 2.490459 2
0 2.227415 13
0 2.254356 15
0 2.058029 4
0 2.434417 9

200 rows × 2 columns

New levels can be added in the following way:

df.etl.add_conditions(conditions=['cond1', 'cond2'], values=[111, 222])
time gid
cond1 cond2 simulation_id circuit_id neuron_class window trial
111 222 0 0 L23_INH w1 0 2.548814 6
0 2.715189 7
L23_EXC w1 0 2.602763 18
0 2.544883 11
0 2.423655 19
... ... ... ... ... ... ...
2 0 L23_INH w1 0 2.490459 2
L23_EXC w1 0 2.227415 13
0 2.254356 15
L23_INH w1 0 2.058029 4
0 2.434417 9

200 rows × 2 columns

Iteration

In general, iterations can be slow and should be avoided in favour of vectorized operations.

Iteration over DataFrames

The method etl.iter() can be used as an alternative to Pandas itertuples() when a namedtuple is desired also for the index:

for index, values in df.etl.iter():
    print(index)
    print(values)
    break
Index(simulation_id=0, circuit_id=0, neuron_class='L23_INH', window='w1', trial=0)
Values(time=2.5488135039273248, gid=6)

In some cases, it may be more convenient to work with dictionaries, accepting some performance penalty:

for index, values in df.etl.iterdict():
    print(index)
    print(values)
    break
{'simulation_id': 0, 'circuit_id': 0, 'neuron_class': 'L23_INH', 'window': 'w1', 'trial': 0}
{'time': 2.5488135039273248, 'gid': 6}

For comparison, this is the result with Pandas itertuples():

for item in df.itertuples():
    print(item)
    break
Pandas(Index=(0, 0, 'L23_INH', 'w1', 0), time=2.5488135039273248, gid=6)

Pandas iterrows() method should be avoided because it may be a lot slower than the other methods:

for index, values in df.iterrows():
    print(index)
    print(values)
    break
(0, 0, 'L23_INH', 'w1', 0)
time    2.548814
gid     6.000000
Name: (0, 0, L23_INH, w1, 0), dtype: float64

Iteration over Series

The etl.iter() method can be used also to iterate over Series, and in this case the value is just the value of the item:

for index, values in df['time'].etl.iter():
    print(index)
    print(values)
    break
Index(simulation_id=0, circuit_id=0, neuron_class='L23_INH', window='w1', trial=0)
2.5488135039273248

Similarly, when used with Series, etl.iterdict() will yield a dict for the index, and the plain value for the item:

for index, values in df['time'].etl.iterdict():
    print(index)
    print(values)
    break
{'simulation_id': 0, 'circuit_id': 0, 'neuron_class': 'L23_INH', 'window': 'w1', 'trial': 0}
2.5488135039273248

Grouping

In some cases it can be useful to use etl.groupby_except() to group by all the conditions, except some of them:

df.etl.groupby_except(['neuron_class', 'window', 'trial']).count()
time gid
simulation_id circuit_id
0 0 100 100
1 0 75 75
2 0 25 25

It’s then possible to apply a user defined function to the result:

from scipy.stats import entropy

def response_entropy(x):
    return entropy(x, base=2) / np.log2(len(x))

df['time'].etl.groupby_except(['neuron_class', 'window', 'trial']).apply(response_entropy)
simulation_id  circuit_id
0              0             0.998522
1              0             0.998623
2              0             0.998177
Name: time, dtype: float64

The method etl.groupby_iter() can be used to get a generator yielding each record of the grouped dataframe as a tuple (key, df).

In general, etl.groupby_iter() is not a replacement for groupby() because it returns a generator, while groupby() returns a DataFrameGroupBy.

However, it can be used as a replacement for the iteration over groupby(), with these differences:

  • the yielded keys are namedtuples, instead of tuples

  • it’s possible to specify the list of column names to be included in the yielded dataframes

Iterating over df:

for key, tmp_df in df.etl.groupby_iter(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'gid']):
    print(key)
    print(tmp_df)
    break
RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_EXC', window='w1', gid=10)
                                                        time  gid
simulation_id circuit_id neuron_class window trial               
0             0          L23_EXC      w1     0      2.832620   10
                                             0      2.568434   10
                                             0      2.438602   10
                                             0      2.368725   10
                                             0      2.976459   10

For comparison, with Pandas:

for key, tmp_df in df.groupby(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'gid']):
    print(key)
    print(tmp_df)
    break
(0, 0, 'L23_EXC', 'w1', 10)
                                                        time  gid
simulation_id circuit_id neuron_class window trial               
0             0          L23_EXC      w1     0      2.832620   10
                                             0      2.568434   10
                                             0      2.438602   10
                                             0      2.368725   10
                                             0      2.976459   10

Iterating over df_col (columnar representation):

for key, tmp_df in df_col.etl.groupby_iter(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'gid']):
    print(key)
    print(tmp_df)
    break
RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_EXC', window='w1', gid=10)
    simulation_id  circuit_id neuron_class window  trial      time  gid
17              0           0      L23_EXC     w1      0  2.832620   10
33              0           0      L23_EXC     w1      0  2.568434   10
51              0           0      L23_EXC     w1      0  2.438602   10
65              0           0      L23_EXC     w1      0  2.368725   10
70              0           0      L23_EXC     w1      0  2.976459   10

For comparison, with Pandas:

for key, tmp_df in df_col.groupby(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'gid']):
    print(key)
    print(tmp_df)
    break
(0, 0, 'L23_EXC', 'w1', 10)
    simulation_id  circuit_id neuron_class window  trial      time  gid
17              0           0      L23_EXC     w1      0  2.832620   10
33              0           0      L23_EXC     w1      0  2.568434   10
51              0           0      L23_EXC     w1      0  2.438602   10
65              0           0      L23_EXC     w1      0  2.368725   10
70              0           0      L23_EXC     w1      0  2.976459   10

Multiprocessing

With etl.groupby_run_parallel() it’s possible to group the dataframe and run a function in parallel processes. The function should accept (key, df) and it can return anything.

This method should be used only when the function is expensive and it’s worth to execute it in a separate subprocess. If the function isn’t expensive, iterating over etl.groupby_iter() and calling the function is probably faster.

In the following example, the defined function is executed in separate processes and it just returns the grouping key and the length of the partial dataframe.

def func(key, df):
    return key, len(df)

df.etl.groupby_run_parallel(['simulation_id', 'circuit_id', 'neuron_class'], func=func)
[(RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_EXC'), 52),
 (RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_INH'), 48),
 (RecordKey(simulation_id=1, circuit_id=0, neuron_class='L23_EXC'), 28),
 (RecordKey(simulation_id=1, circuit_id=0, neuron_class='L23_INH'), 47),
 (RecordKey(simulation_id=2, circuit_id=0, neuron_class='L23_EXC'), 12),
 (RecordKey(simulation_id=2, circuit_id=0, neuron_class='L23_INH'), 13)]

The same result is returned when etl.groupby_run_parallel() is applied to the columnar df_col:

def func(key, df):
    return key, len(df)

df_col.etl.groupby_run_parallel(['simulation_id', 'circuit_id', 'neuron_class'], func=func)
[(RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_EXC'), 52),
 (RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_INH'), 48),
 (RecordKey(simulation_id=1, circuit_id=0, neuron_class='L23_EXC'), 28),
 (RecordKey(simulation_id=1, circuit_id=0, neuron_class='L23_INH'), 47),
 (RecordKey(simulation_id=2, circuit_id=0, neuron_class='L23_EXC'), 12),
 (RecordKey(simulation_id=2, circuit_id=0, neuron_class='L23_INH'), 13)]

If the function returns a DataFrame and all the DataFrames should be concatenated together to create a single DataFrame, it’s possible to use etl.groupby_apply_parallel().

This is just a shortcut to calling etl.groupby_run_parallel() and concatenate the results with pd.concat().

Note that this method should be considered still experimental and subject to changes.

def func(key, df):
    return pd.DataFrame(
        data={"mean": [df['time'].mean()]},
        index=pd.MultiIndex.from_tuples([key], names=key._fields)
    )

df_col.etl.groupby_apply_parallel(['simulation_id', 'circuit_id', 'window'], func=func)
mean
simulation_id circuit_id window
0 0 w1 2.472794
1 0 w1 2.550862
2 0 w1 2.459742

In simple cases like in the previous example, it may be more efficient to avoid multiprocessing:

df_col.groupby(['simulation_id', 'circuit_id', 'window'])[['time']].mean()
time
simulation_id circuit_id window
0 0 w1 2.472794
1 0 w1 2.550862
2 0 w1 2.459742