01 Core transformations¶

Initialization¶

Filtering¶

Working with MultiIndexes¶

Iteration¶

Grouping¶

Multiprocessing¶

Select by equality¶

Select by range¶

Select by regular expression¶

Select by variable keys¶

Union of selections¶

Comparison with plain Pandas¶

Iteration over DataFrames¶

Iteration over Series¶

To register the custom Pandas accessor .etl it’s enough to import blueetl:

import blueetl

The Pandas DataFrames df_col (columnar) and df (multiindex) created below will be useful to show how the etl accessor can be used in the two cases.

import numpy as np
import pandas as pd

np.random.seed(0)
df_col = pd.DataFrame(
    {
        "simulation_id": [0] * 100 + [1] * 75 + [2] * 25,
        "circuit_id": 0,
        "neuron_class": "L23_EXC",
        "window": "w1",
        "trial": 0,
        "time": np.random.rand(200) + 2,
        "gid": np.random.randint(20, size=200),
    }
)
df_col.loc[df_col["gid"] < 10, "neuron_class"] = "L23_INH"

df = df_col.set_index(["simulation_id", "circuit_id", "neuron_class", "window", "trial"])
display(df)

					time	gid
simulation_id	circuit_id	neuron_class	window	trial
0	0	L23_INH	w1	0	2.548814	6
		L23_INH	w1	0	2.715189	7
		L23_EXC	w1	0	2.602763	18
				0	2.544883	11
				0	2.423655	19
...	...	...	...	...	...	...
2	0	L23_INH	w1	0	2.490459	2
		L23_EXC	w1	0	2.227415	13
		L23_EXC	w1	0	2.254356	15
		L23_INH	w1	0	2.058029	4
		L23_INH	w1	0	2.434417	9

200 rows × 2 columns

DataFrames and Series can be filtered on both columns and levels of the multiindex using etl.q, using the same syntax.

Multiple keywords can be specified as parameters, and all the conditions are matched in the resulting dataframe or series.

Each parameter can be:

a scalar value matched for equality
a list to match any contained scalar value
a dict for more complex expressions. The supported operators are: eq, ne, le, lt, ge, gt, isin, regex

If filtering by equality over the index or the columns, the query can be expressed simply as:

df.etl.q(simulation_id=1, neuron_class="L23_EXC", gid=[9, 10, 11])

					time	gid
simulation_id	circuit_id	neuron_class	window	trial
1	0	L23_EXC	w1	0	2.447125	10
				0	2.581273	10
				0	2.725254	11
				0	2.290078	10
				0	2.135474	11
				0	2.011714	11
				0	2.199997	11

The columnar dataframe df_col can be filtered using the same parameters:

df_col.etl.q(simulation_id=1, neuron_class="L23_EXC", gid=[9, 10, 11])

	simulation_id	neuron_class	window	time	gid
110	1	L23_EXC	w1	2.447125	10
117	1	L23_EXC	w1	2.581273	10
120	1	L23_EXC	w1	2.725254	11
129	1	L23_EXC	w1	2.290078	10
132	1	L23_EXC	w1	2.135474	11
166	1	L23_EXC	w1	2.011714	11
172	1	L23_EXC	w1	2.199997	11

It’s possible to get the same result using the operators ge and le to filter on gid in the following way:

df.etl.q(simulation_id=1, neuron_class="L23_EXC", gid={"ge": 9, "le": 11})

					time	gid
simulation_id	circuit_id	neuron_class	window	trial
1	0	L23_EXC	w1	0	2.447125	10
				0	2.581273	10
				0	2.725254	11
				0	2.290078	10
				0	2.135474	11
				0	2.011714	11
				0	2.199997	11

In general, the selection based on regular expressions is less performant than the other types of filters, so it should be used as a last resort.

When needed, it can be used in case of columns containing string objects.

Internally, the comparison is handled by re.search from the standard Python library, so any substring can match the expression.

See the official documentation for more information, but remember that you can:

use the prefix ^ to match from the beginning of the string
use the suffix $ to match up to the end of the string

df.etl.q(simulation_id=1, neuron_class={"regex": "^L2"}, gid=[2, 15])

					time	gid
simulation_id	circuit_id	neuron_class	window	trial
1	0	L23_EXC	w1	0	2.881103	15
		L23_EXC	w1	0	2.643990	15
		L23_INH	w1	0	2.618015	2
		L23_INH	w1	0	2.703889	2

If the keys used to filter are variable, it’s possible to pass a dict instead of named parameters:

for key, value in [("simulation_id", 1), ("gid", 10)]:
    result = df.etl.q({key: value})
    display(result.head())
    print(f"### Filtered by {key}={value}")

					time	gid
simulation_id	circuit_id	neuron_class	window	trial
1	0	L23_INH	w1	0	2.677817	7
		L23_INH	w1	0	2.270008	9
		L23_EXC	w1	0	2.735194	17
		L23_INH	w1	0	2.962189	9
		L23_INH	w1	0	2.248753	0

### Filtered by simulation_id=1

### Filtered by gid=10

Alternatively, it’s possible to use the standard python dict unpacking syntax with **:

for key, value in [("simulation_id", 1), ("gid", 10)]:
    result = df.etl.q(**{key: value})
    display(result.head())
    print(f"### Filtered by {key}={value}")

					time	gid
simulation_id	circuit_id	neuron_class	window	trial
1	0	L23_INH	w1	0	2.677817	7
		L23_INH	w1	0	2.270008	9
		L23_EXC	w1	0	2.735194	17
		L23_INH	w1	0	2.962189	9
		L23_INH	w1	0	2.248753	0

### Filtered by simulation_id=1

### Filtered by gid=10

Besides passing a dict or named parameters to etl.q, it’s possible to pass a list of dictionaries.

In this way, the method will return the union of the rows selected by each dictionary.

For performance reasons, you should pass a list of dictionaries only when the filter cannot be simply expressed using a single dictionary.

For example:

df_col.etl.q(
    [
        {"simulation_id": 1, "neuron_class": "L23_EXC", "gid": [10]},
        {"simulation_id": 1, "neuron_class": "L23_INH", "gid": [9]},
    ]
)

	simulation_id	neuron_class	window	time	gid
101	1	L23_INH	w1	2.270008	9
103	1	L23_INH	w1	2.962189	9
110	1	L23_EXC	w1	2.447125	10
117	1	L23_EXC	w1	2.581273	10
122	1	L23_INH	w1	2.956084	9
129	1	L23_EXC	w1	2.290078	10
152	1	L23_INH	w1	2.162493	9
167	1	L23_INH	w1	2.359978	9

Let’s consider the following query:

df.etl.q(
    simulation_id=1,
    window=["w1", "w2"],
    gid={"ge": 8, "lt": 12},
    time={"lt": 2.1},
)

					time	gid
simulation_id	circuit_id	neuron_class	window	trial
1	0	L23_EXC	w1	0	2.011714	11
1	0	L23_INH	w1	0	2.054338	8

The same result can be obtained with plain Pandas, but it can be more verbose, and more difficult to chain the results, as shown in the example below.

df.loc[
    (df.index.get_level_values('simulation_id') == 1)
    & (df.index.get_level_values('window').isin(["w1", "w2"]))
    & (df['gid'] >= 8)
    & (df['gid'] < 12)
    & (df['time'] < 2.1)
]

					time	gid
simulation_id	circuit_id	neuron_class	window	trial
1	0	L23_EXC	w1	0	2.011714	11
1	0	L23_INH	w1	0	2.054338	8

As an alternative, it’s possible to use the Pandas query() method as shown below. However, this operation can be slower than the previous methods.

df.query("simulation_id==1 and window==['w1', 'w2'] and gid >= 8 and gid < 12 and time < 2.1")

					time	gid
simulation_id	circuit_id	neuron_class	window	trial
1	0	L23_EXC	w1	0	2.011714	11
1	0	L23_INH	w1	0	2.054338	8

When a query cannot be built with etl.q, it’s still possible to use the plain Pandas syntax, since the returned objects are DataFrames or Series.

Retrieve the names of all the MultiIndex levels (conditions):

df.etl.conditions()

FrozenList(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'trial'])

Retrieve the difference between all the conditions and the specified conditions:

df.etl.complementary_conditions(['trial', 'simulation_id'])

FrozenList(['circuit_id', 'neuron_class', 'window'])

Retrieve the unique labels for each level:

df.etl.labels()

[Index([0, 1, 2], dtype='int64', name='simulation_id'),
 Index([0], dtype='int64', name='circuit_id'),
 Index(['L23_INH', 'L23_EXC'], dtype='object', name='neuron_class'),
 Index(['w1'], dtype='object', name='window'),
 Index([0], dtype='int64', name='trial')]

or for specific levels:

df.etl.labels(["neuron_class"])

[Index(['L23_INH', 'L23_EXC'], dtype='object', name='neuron_class')]

It’s possible to remove one or more conditions. However, note that the resulting MultiIndex may be not unique anymore.

df.etl.remove_conditions(['window', 'trial'])

			time	gid
simulation_id	circuit_id	neuron_class
0	0	L23_INH	2.548814	6
		L23_INH	2.715189	7
		L23_EXC	2.602763	18
		L23_EXC	2.544883	11
		L23_EXC	2.423655	19
...	...	...	...	...
2	0	L23_INH	2.490459	2
		L23_EXC	2.227415	13
		L23_EXC	2.254356	15
		L23_INH	2.058029	4
		L23_INH	2.434417	9

200 rows × 2 columns

It’s also possible to specify only the conditions to keep:

df.etl.keep_conditions(['window', 'trial'])

		time	gid
window	trial
w1	0	2.548814	6
	0	2.715189	7
	0	2.602763	18
	0	2.544883	11
	0	2.423655	19
	...	...	...
	0	2.490459	2
	0	2.227415	13
	0	2.254356	15
	0	2.058029	4
	0	2.434417	9

200 rows × 2 columns

New levels can be added in the following way:

df.etl.add_conditions(conditions=['cond1', 'cond2'], values=[111, 222])

							time	gid
cond1	cond2	simulation_id	circuit_id	neuron_class	window	trial
111	222	0	0	L23_INH	w1	0	2.548814	6
				L23_INH	w1	0	2.715189	7
				L23_EXC	w1	0	2.602763	18
						0	2.544883	11
						0	2.423655	19
		...	...	...	...	...	...	...
		2	0	L23_INH	w1	0	2.490459	2
				L23_EXC	w1	0	2.227415	13
				L23_EXC	w1	0	2.254356	15
				L23_INH	w1	0	2.058029	4
				L23_INH	w1	0	2.434417	9

200 rows × 2 columns

In general, iterations can be slow and should be avoided in favour of vectorized operations.

The method etl.iter() can be used as an alternative to Pandas itertuples() when a namedtuple is desired also for the index:

for index, values in df.etl.iter():
    print(index)
    print(values)
    break

Index(simulation_id=0, circuit_id=0, neuron_class='L23_INH', window='w1', trial=0)
Values(time=2.5488135039273248, gid=6)

In some cases, it may be more convenient to work with dictionaries, accepting some performance penalty:

for index, values in df.etl.iterdict():
    print(index)
    print(values)
    break

{'simulation_id': 0, 'circuit_id': 0, 'neuron_class': 'L23_INH', 'window': 'w1', 'trial': 0}
{'time': 2.5488135039273248, 'gid': 6}

For comparison, this is the result with Pandas itertuples():

for item in df.itertuples():
    print(item)
    break

Pandas(Index=(0, 0, 'L23_INH', 'w1', 0), time=2.5488135039273248, gid=6)

Pandas iterrows() method should be avoided because it may be a lot slower than the other methods:

for index, values in df.iterrows():
    print(index)
    print(values)
    break

(0, 0, 'L23_INH', 'w1', 0)
time    2.548814
gid     6.000000
Name: (0, 0, L23_INH, w1, 0), dtype: float64

The etl.iter() method can be used also to iterate over Series, and in this case the value is just the value of the item:

for index, values in df['time'].etl.iter():
    print(index)
    print(values)
    break

Index(simulation_id=0, circuit_id=0, neuron_class='L23_INH', window='w1', trial=0)
2.5488135039273248

Similarly, when used with Series, etl.iterdict() will yield a dict for the index, and the plain value for the item:

for index, values in df['time'].etl.iterdict():
    print(index)
    print(values)
    break

{'simulation_id': 0, 'circuit_id': 0, 'neuron_class': 'L23_INH', 'window': 'w1', 'trial': 0}
2.5488135039273248

In some cases it can be useful to use etl.groupby_except() to group by all the conditions, except some of them:

df.etl.groupby_except(['neuron_class', 'window', 'trial']).count()

		time	gid
simulation_id	circuit_id
0	0	100	100
1	0	75	75
2	0	25	25

It’s then possible to apply a user defined function to the result:

from scipy.stats import entropy

def response_entropy(x):
    return entropy(x, base=2) / np.log2(len(x))

df['time'].etl.groupby_except(['neuron_class', 'window', 'trial']).apply(response_entropy)

simulation_id  circuit_id
0              0             0.998522
1              0             0.998623
2              0             0.998177
Name: time, dtype: float64

The method etl.groupby_iter() can be used to get a generator yielding each record of the grouped dataframe as a tuple (key, df).

In general, etl.groupby_iter() is not a replacement for groupby() because it returns a generator, while groupby() returns a DataFrameGroupBy.

However, it can be used as a replacement for the iteration over groupby(), with these differences:

the yielded keys are namedtuples, instead of tuples
it’s possible to specify the list of column names to be included in the yielded dataframes

Iterating over df:

for key, tmp_df in df.etl.groupby_iter(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'gid']):
    print(key)
    print(tmp_df)
    break

RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_EXC', window='w1', gid=10)
                                                        time  gid
simulation_id circuit_id neuron_class window trial               
0             0          L23_EXC      w1     0      2.832620   10
                                             0      2.568434   10
                                             0      2.438602   10
                                             0      2.368725   10
                                             0      2.976459   10

For comparison, with Pandas:

for key, tmp_df in df.groupby(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'gid']):
    print(key)
    print(tmp_df)
    break

(0, 0, 'L23_EXC', 'w1', 10)
                                                        time  gid
simulation_id circuit_id neuron_class window trial               
0             0          L23_EXC      w1     0      2.832620   10
                                             0      2.568434   10
                                             0      2.438602   10
                                             0      2.368725   10
                                             0      2.976459   10

Iterating over df_col (columnar representation):

for key, tmp_df in df_col.etl.groupby_iter(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'gid']):
    print(key)
    print(tmp_df)
    break

RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_EXC', window='w1', gid=10)
    simulation_id  circuit_id neuron_class window  trial      time  gid
            0           0      L23_EXC     w1      0  2.832620   10
            0           0      L23_EXC     w1      0  2.568434   10
            0           0      L23_EXC     w1      0  2.438602   10
            0           0      L23_EXC     w1      0  2.368725   10
            0           0      L23_EXC     w1      0  2.976459   10

For comparison, with Pandas:

for key, tmp_df in df_col.groupby(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'gid']):
    print(key)
    print(tmp_df)
    break

(0, 0, 'L23_EXC', 'w1', 10)
    simulation_id  circuit_id neuron_class window  trial      time  gid
            0           0      L23_EXC     w1      0  2.832620   10
            0           0      L23_EXC     w1      0  2.568434   10
            0           0      L23_EXC     w1      0  2.438602   10
            0           0      L23_EXC     w1      0  2.368725   10
            0           0      L23_EXC     w1      0  2.976459   10

simulation_id

circuit_id

neuron_class

With etl.groupby_run_parallel() it’s possible to group the dataframe and run a function in parallel processes. The function should accept (key, df) and it can return anything.

This method should be used only when the function is expensive and it’s worth to execute it in a separate subprocess. If the function isn’t expensive, iterating over etl.groupby_iter() and calling the function is probably faster.

In the following example, the defined function is executed in separate processes and it just returns the grouping key and the length of the partial dataframe.

def func(key, df):
    return key, len(df)

df.etl.groupby_run_parallel(['simulation_id', 'circuit_id', 'neuron_class'], func=func)

[(RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_EXC'), 52),
 (RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_INH'), 48),
 (RecordKey(simulation_id=1, circuit_id=0, neuron_class='L23_EXC'), 28),
 (RecordKey(simulation_id=1, circuit_id=0, neuron_class='L23_INH'), 47),
 (RecordKey(simulation_id=2, circuit_id=0, neuron_class='L23_EXC'), 12),
 (RecordKey(simulation_id=2, circuit_id=0, neuron_class='L23_INH'), 13)]

The same result is returned when etl.groupby_run_parallel() is applied to the columnar df_col:

def func(key, df):
    return key, len(df)

df_col.etl.groupby_run_parallel(['simulation_id', 'circuit_id', 'neuron_class'], func=func)

[(RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_EXC'), 52),
 (RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_INH'), 48),
 (RecordKey(simulation_id=1, circuit_id=0, neuron_class='L23_EXC'), 28),
 (RecordKey(simulation_id=1, circuit_id=0, neuron_class='L23_INH'), 47),
 (RecordKey(simulation_id=2, circuit_id=0, neuron_class='L23_EXC'), 12),
 (RecordKey(simulation_id=2, circuit_id=0, neuron_class='L23_INH'), 13)]

If the function returns a DataFrame and all the DataFrames should be concatenated together to create a single DataFrame, it’s possible to use etl.groupby_apply_parallel().

This is just a shortcut to calling etl.groupby_run_parallel() and concatenate the results with pd.concat().

Note that this method should be considered still experimental and subject to changes.

def func(key, df):
    return pd.DataFrame(
        data={"mean": [df['time'].mean()]},
        index=pd.MultiIndex.from_tuples([key], names=key._fields)
    )

df_col.etl.groupby_apply_parallel(['simulation_id', 'circuit_id', 'window'], func=func)

			mean
simulation_id	circuit_id	window
0	0	w1	2.472794
1	0	w1	2.550862
2	0	w1	2.459742

In simple cases like in the previous example, it may be more efficient to avoid multiprocessing:

df_col.groupby(['simulation_id', 'circuit_id', 'window'])[['time']].mean()

			time
simulation_id	circuit_id	window
0	0	w1	2.472794
1	0	w1	2.550862
2	0	w1	2.459742