01 Core transformations¶
Initialization¶
To register the custom Pandas accessor .etl
it’s enough to import blueetl
:
import blueetl
The Pandas DataFrames df_col
(columnar) and df
(multiindex) created below will be useful to show how the etl
accessor can be used in the two cases.
import numpy as np
import pandas as pd
np.random.seed(0)
df_col = pd.DataFrame(
{
"simulation_id": [0] * 100 + [1] * 75 + [2] * 25,
"circuit_id": 0,
"neuron_class": "L23_EXC",
"window": "w1",
"trial": 0,
"time": np.random.rand(200) + 2,
"gid": np.random.randint(20, size=200),
}
)
df_col.loc[df_col["gid"] < 10, "neuron_class"] = "L23_INH"
df = df_col.set_index(["simulation_id", "circuit_id", "neuron_class", "window", "trial"])
display(df)
time | gid | |||||
---|---|---|---|---|---|---|
simulation_id | circuit_id | neuron_class | window | trial | ||
0 | 0 | L23_INH | w1 | 0 | 2.548814 | 6 |
0 | 2.715189 | 7 | ||||
L23_EXC | w1 | 0 | 2.602763 | 18 | ||
0 | 2.544883 | 11 | ||||
0 | 2.423655 | 19 | ||||
... | ... | ... | ... | ... | ... | ... |
2 | 0 | L23_INH | w1 | 0 | 2.490459 | 2 |
L23_EXC | w1 | 0 | 2.227415 | 13 | ||
0 | 2.254356 | 15 | ||||
L23_INH | w1 | 0 | 2.058029 | 4 | ||
0 | 2.434417 | 9 |
200 rows × 2 columns
Filtering¶
DataFrames and Series can be filtered on both columns and levels of the multiindex using etl.q
, using the same syntax.
Multiple keywords can be specified as parameters, and all the conditions are matched in the resulting dataframe or series.
Each parameter can be:
a scalar value matched for equality
a list to match any contained scalar value
a dict for more complex expressions. The supported operators are:
eq
,ne
,le
,lt
,ge
,gt
,isin
,regex
Select by equality¶
If filtering by equality over the index or the columns, the query can be expressed simply as:
df.etl.q(simulation_id=1, neuron_class="L23_EXC", gid=[9, 10, 11])
time | gid | |||||
---|---|---|---|---|---|---|
simulation_id | circuit_id | neuron_class | window | trial | ||
1 | 0 | L23_EXC | w1 | 0 | 2.447125 | 10 |
0 | 2.581273 | 10 | ||||
0 | 2.725254 | 11 | ||||
0 | 2.290078 | 10 | ||||
0 | 2.135474 | 11 | ||||
0 | 2.011714 | 11 | ||||
0 | 2.199997 | 11 |
The columnar dataframe df_col
can be filtered using the same parameters:
df_col.etl.q(simulation_id=1, neuron_class="L23_EXC", gid=[9, 10, 11])
simulation_id | circuit_id | neuron_class | window | trial | time | gid | |
---|---|---|---|---|---|---|---|
110 | 1 | 0 | L23_EXC | w1 | 0 | 2.447125 | 10 |
117 | 1 | 0 | L23_EXC | w1 | 0 | 2.581273 | 10 |
120 | 1 | 0 | L23_EXC | w1 | 0 | 2.725254 | 11 |
129 | 1 | 0 | L23_EXC | w1 | 0 | 2.290078 | 10 |
132 | 1 | 0 | L23_EXC | w1 | 0 | 2.135474 | 11 |
166 | 1 | 0 | L23_EXC | w1 | 0 | 2.011714 | 11 |
172 | 1 | 0 | L23_EXC | w1 | 0 | 2.199997 | 11 |
Select by range¶
It’s possible to get the same result using the operators ge
and le
to filter on gid
in the following way:
df.etl.q(simulation_id=1, neuron_class="L23_EXC", gid={"ge": 9, "le": 11})
time | gid | |||||
---|---|---|---|---|---|---|
simulation_id | circuit_id | neuron_class | window | trial | ||
1 | 0 | L23_EXC | w1 | 0 | 2.447125 | 10 |
0 | 2.581273 | 10 | ||||
0 | 2.725254 | 11 | ||||
0 | 2.290078 | 10 | ||||
0 | 2.135474 | 11 | ||||
0 | 2.011714 | 11 | ||||
0 | 2.199997 | 11 |
Select by regular expression¶
In general, the selection based on regular expressions is less performant than the other types of filters, so it should be used as a last resort.
When needed, it can be used in case of columns containing string objects.
Internally, the comparison is handled by re.search
from the standard Python library, so any substring can match the expression.
See the official documentation for more information, but remember that you can:
use the prefix
^
to match from the beginning of the stringuse the suffix
$
to match up to the end of the string
df.etl.q(simulation_id=1, neuron_class={"regex": "^L2"}, gid=[2, 15])
time | gid | |||||
---|---|---|---|---|---|---|
simulation_id | circuit_id | neuron_class | window | trial | ||
1 | 0 | L23_EXC | w1 | 0 | 2.881103 | 15 |
0 | 2.643990 | 15 | ||||
L23_INH | w1 | 0 | 2.618015 | 2 | ||
0 | 2.703889 | 2 |
Select by variable keys¶
If the keys used to filter are variable, it’s possible to pass a dict instead of named parameters:
for key, value in [("simulation_id", 1), ("gid", 10)]:
result = df.etl.q({key: value})
display(result.head())
print(f"### Filtered by {key}={value}")
time | gid | |||||
---|---|---|---|---|---|---|
simulation_id | circuit_id | neuron_class | window | trial | ||
1 | 0 | L23_INH | w1 | 0 | 2.677817 | 7 |
0 | 2.270008 | 9 | ||||
L23_EXC | w1 | 0 | 2.735194 | 17 | ||
L23_INH | w1 | 0 | 2.962189 | 9 | ||
0 | 2.248753 | 0 |
### Filtered by simulation_id=1
time | gid | |||||
---|---|---|---|---|---|---|
simulation_id | circuit_id | neuron_class | window | trial | ||
0 | 0 | L23_EXC | w1 | 0 | 2.832620 | 10 |
0 | 2.568434 | 10 | ||||
0 | 2.438602 | 10 | ||||
0 | 2.368725 | 10 | ||||
0 | 2.976459 | 10 |
### Filtered by gid=10
Alternatively, it’s possible to use the standard python dict unpacking syntax with **
:
for key, value in [("simulation_id", 1), ("gid", 10)]:
result = df.etl.q(**{key: value})
display(result.head())
print(f"### Filtered by {key}={value}")
time | gid | |||||
---|---|---|---|---|---|---|
simulation_id | circuit_id | neuron_class | window | trial | ||
1 | 0 | L23_INH | w1 | 0 | 2.677817 | 7 |
0 | 2.270008 | 9 | ||||
L23_EXC | w1 | 0 | 2.735194 | 17 | ||
L23_INH | w1 | 0 | 2.962189 | 9 | ||
0 | 2.248753 | 0 |
### Filtered by simulation_id=1
time | gid | |||||
---|---|---|---|---|---|---|
simulation_id | circuit_id | neuron_class | window | trial | ||
0 | 0 | L23_EXC | w1 | 0 | 2.832620 | 10 |
0 | 2.568434 | 10 | ||||
0 | 2.438602 | 10 | ||||
0 | 2.368725 | 10 | ||||
0 | 2.976459 | 10 |
### Filtered by gid=10
Union of selections¶
Besides passing a dict or named parameters to etl.q
, it’s possible to pass a list of dictionaries.
In this way, the method will return the union of the rows selected by each dictionary.
For performance reasons, you should pass a list of dictionaries only when the filter cannot be simply expressed using a single dictionary.
For example:
df_col.etl.q(
[
{"simulation_id": 1, "neuron_class": "L23_EXC", "gid": [10]},
{"simulation_id": 1, "neuron_class": "L23_INH", "gid": [9]},
]
)
simulation_id | circuit_id | neuron_class | window | trial | time | gid | |
---|---|---|---|---|---|---|---|
101 | 1 | 0 | L23_INH | w1 | 0 | 2.270008 | 9 |
103 | 1 | 0 | L23_INH | w1 | 0 | 2.962189 | 9 |
110 | 1 | 0 | L23_EXC | w1 | 0 | 2.447125 | 10 |
117 | 1 | 0 | L23_EXC | w1 | 0 | 2.581273 | 10 |
122 | 1 | 0 | L23_INH | w1 | 0 | 2.956084 | 9 |
129 | 1 | 0 | L23_EXC | w1 | 0 | 2.290078 | 10 |
152 | 1 | 0 | L23_INH | w1 | 0 | 2.162493 | 9 |
167 | 1 | 0 | L23_INH | w1 | 0 | 2.359978 | 9 |
Comparison with plain Pandas¶
Let’s consider the following query:
df.etl.q(
simulation_id=1,
window=["w1", "w2"],
gid={"ge": 8, "lt": 12},
time={"lt": 2.1},
)
time | gid | |||||
---|---|---|---|---|---|---|
simulation_id | circuit_id | neuron_class | window | trial | ||
1 | 0 | L23_EXC | w1 | 0 | 2.011714 | 11 |
L23_INH | w1 | 0 | 2.054338 | 8 |
The same result can be obtained with plain Pandas, but it can be more verbose, and more difficult to chain the results, as shown in the example below.
df.loc[
(df.index.get_level_values('simulation_id') == 1)
& (df.index.get_level_values('window').isin(["w1", "w2"]))
& (df['gid'] >= 8)
& (df['gid'] < 12)
& (df['time'] < 2.1)
]
time | gid | |||||
---|---|---|---|---|---|---|
simulation_id | circuit_id | neuron_class | window | trial | ||
1 | 0 | L23_EXC | w1 | 0 | 2.011714 | 11 |
L23_INH | w1 | 0 | 2.054338 | 8 |
As an alternative, it’s possible to use the Pandas query()
method as shown below.
However, this operation can be slower than the previous methods.
df.query("simulation_id==1 and window==['w1', 'w2'] and gid >= 8 and gid < 12 and time < 2.1")
time | gid | |||||
---|---|---|---|---|---|---|
simulation_id | circuit_id | neuron_class | window | trial | ||
1 | 0 | L23_EXC | w1 | 0 | 2.011714 | 11 |
L23_INH | w1 | 0 | 2.054338 | 8 |
When a query cannot be built with etl.q
, it’s still possible to use the plain Pandas syntax, since the returned objects are DataFrames or Series.
Working with MultiIndexes¶
Retrieve the names of all the MultiIndex levels (conditions):
df.etl.conditions()
FrozenList(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'trial'])
Retrieve the difference between all the conditions and the specified conditions:
df.etl.complementary_conditions(['trial', 'simulation_id'])
FrozenList(['circuit_id', 'neuron_class', 'window'])
Retrieve the unique labels for each level:
df.etl.labels()
[Index([0, 1, 2], dtype='int64', name='simulation_id'),
Index([0], dtype='int64', name='circuit_id'),
Index(['L23_INH', 'L23_EXC'], dtype='object', name='neuron_class'),
Index(['w1'], dtype='object', name='window'),
Index([0], dtype='int64', name='trial')]
or for specific levels:
df.etl.labels(["neuron_class"])
[Index(['L23_INH', 'L23_EXC'], dtype='object', name='neuron_class')]
It’s possible to remove one or more conditions. However, note that the resulting MultiIndex may be not unique anymore.
df.etl.remove_conditions(['window', 'trial'])
time | gid | |||
---|---|---|---|---|
simulation_id | circuit_id | neuron_class | ||
0 | 0 | L23_INH | 2.548814 | 6 |
L23_INH | 2.715189 | 7 | ||
L23_EXC | 2.602763 | 18 | ||
L23_EXC | 2.544883 | 11 | ||
L23_EXC | 2.423655 | 19 | ||
... | ... | ... | ... | ... |
2 | 0 | L23_INH | 2.490459 | 2 |
L23_EXC | 2.227415 | 13 | ||
L23_EXC | 2.254356 | 15 | ||
L23_INH | 2.058029 | 4 | ||
L23_INH | 2.434417 | 9 |
200 rows × 2 columns
It’s also possible to specify only the conditions to keep:
df.etl.keep_conditions(['window', 'trial'])
time | gid | ||
---|---|---|---|
window | trial | ||
w1 | 0 | 2.548814 | 6 |
0 | 2.715189 | 7 | |
0 | 2.602763 | 18 | |
0 | 2.544883 | 11 | |
0 | 2.423655 | 19 | |
... | ... | ... | |
0 | 2.490459 | 2 | |
0 | 2.227415 | 13 | |
0 | 2.254356 | 15 | |
0 | 2.058029 | 4 | |
0 | 2.434417 | 9 |
200 rows × 2 columns
New levels can be added in the following way:
df.etl.add_conditions(conditions=['cond1', 'cond2'], values=[111, 222])
time | gid | |||||||
---|---|---|---|---|---|---|---|---|
cond1 | cond2 | simulation_id | circuit_id | neuron_class | window | trial | ||
111 | 222 | 0 | 0 | L23_INH | w1 | 0 | 2.548814 | 6 |
0 | 2.715189 | 7 | ||||||
L23_EXC | w1 | 0 | 2.602763 | 18 | ||||
0 | 2.544883 | 11 | ||||||
0 | 2.423655 | 19 | ||||||
... | ... | ... | ... | ... | ... | ... | ||
2 | 0 | L23_INH | w1 | 0 | 2.490459 | 2 | ||
L23_EXC | w1 | 0 | 2.227415 | 13 | ||||
0 | 2.254356 | 15 | ||||||
L23_INH | w1 | 0 | 2.058029 | 4 | ||||
0 | 2.434417 | 9 |
200 rows × 2 columns
Iteration¶
In general, iterations can be slow and should be avoided in favour of vectorized operations.
Iteration over DataFrames¶
The method etl.iter()
can be used as an alternative to Pandas itertuples()
when a namedtuple is desired also for the index:
for index, values in df.etl.iter():
print(index)
print(values)
break
Index(simulation_id=0, circuit_id=0, neuron_class='L23_INH', window='w1', trial=0)
Values(time=2.5488135039273248, gid=6)
In some cases, it may be more convenient to work with dictionaries, accepting some performance penalty:
for index, values in df.etl.iterdict():
print(index)
print(values)
break
{'simulation_id': 0, 'circuit_id': 0, 'neuron_class': 'L23_INH', 'window': 'w1', 'trial': 0}
{'time': 2.5488135039273248, 'gid': 6}
For comparison, this is the result with Pandas itertuples()
:
for item in df.itertuples():
print(item)
break
Pandas(Index=(0, 0, 'L23_INH', 'w1', 0), time=2.5488135039273248, gid=6)
Pandas iterrows()
method should be avoided because it may be a lot slower than the other methods:
for index, values in df.iterrows():
print(index)
print(values)
break
(0, 0, 'L23_INH', 'w1', 0)
time 2.548814
gid 6.000000
Name: (0, 0, L23_INH, w1, 0), dtype: float64
Iteration over Series¶
The etl.iter()
method can be used also to iterate over Series, and in this case the value is just the value of the item:
for index, values in df['time'].etl.iter():
print(index)
print(values)
break
Index(simulation_id=0, circuit_id=0, neuron_class='L23_INH', window='w1', trial=0)
2.5488135039273248
Similarly, when used with Series, etl.iterdict()
will yield a dict for the index, and the plain value for the item:
for index, values in df['time'].etl.iterdict():
print(index)
print(values)
break
{'simulation_id': 0, 'circuit_id': 0, 'neuron_class': 'L23_INH', 'window': 'w1', 'trial': 0}
2.5488135039273248
Grouping¶
In some cases it can be useful to use etl.groupby_except()
to group by all the conditions, except some of them:
df.etl.groupby_except(['neuron_class', 'window', 'trial']).count()
time | gid | ||
---|---|---|---|
simulation_id | circuit_id | ||
0 | 0 | 100 | 100 |
1 | 0 | 75 | 75 |
2 | 0 | 25 | 25 |
It’s then possible to apply a user defined function to the result:
from scipy.stats import entropy
def response_entropy(x):
return entropy(x, base=2) / np.log2(len(x))
df['time'].etl.groupby_except(['neuron_class', 'window', 'trial']).apply(response_entropy)
simulation_id circuit_id
0 0 0.998522
1 0 0.998623
2 0 0.998177
Name: time, dtype: float64
The method etl.groupby_iter()
can be used to get a generator yielding each record of the grouped dataframe as a tuple (key, df).
In general, etl.groupby_iter()
is not a replacement for groupby()
because it returns a generator, while groupby()
returns a DataFrameGroupBy
.
However, it can be used as a replacement for the iteration over groupby()
, with these differences:
the yielded keys are namedtuples, instead of tuples
it’s possible to specify the list of column names to be included in the yielded dataframes
Iterating over df
:
for key, tmp_df in df.etl.groupby_iter(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'gid']):
print(key)
print(tmp_df)
break
RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_EXC', window='w1', gid=10)
time gid
simulation_id circuit_id neuron_class window trial
0 0 L23_EXC w1 0 2.832620 10
0 2.568434 10
0 2.438602 10
0 2.368725 10
0 2.976459 10
For comparison, with Pandas:
for key, tmp_df in df.groupby(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'gid']):
print(key)
print(tmp_df)
break
(0, 0, 'L23_EXC', 'w1', 10)
time gid
simulation_id circuit_id neuron_class window trial
0 0 L23_EXC w1 0 2.832620 10
0 2.568434 10
0 2.438602 10
0 2.368725 10
0 2.976459 10
Iterating over df_col
(columnar representation):
for key, tmp_df in df_col.etl.groupby_iter(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'gid']):
print(key)
print(tmp_df)
break
RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_EXC', window='w1', gid=10)
simulation_id circuit_id neuron_class window trial time gid
17 0 0 L23_EXC w1 0 2.832620 10
33 0 0 L23_EXC w1 0 2.568434 10
51 0 0 L23_EXC w1 0 2.438602 10
65 0 0 L23_EXC w1 0 2.368725 10
70 0 0 L23_EXC w1 0 2.976459 10
For comparison, with Pandas:
for key, tmp_df in df_col.groupby(['simulation_id', 'circuit_id', 'neuron_class', 'window', 'gid']):
print(key)
print(tmp_df)
break
(0, 0, 'L23_EXC', 'w1', 10)
simulation_id circuit_id neuron_class window trial time gid
17 0 0 L23_EXC w1 0 2.832620 10
33 0 0 L23_EXC w1 0 2.568434 10
51 0 0 L23_EXC w1 0 2.438602 10
65 0 0 L23_EXC w1 0 2.368725 10
70 0 0 L23_EXC w1 0 2.976459 10
Multiprocessing¶
With etl.groupby_run_parallel()
it’s possible to group the dataframe and run a function in parallel processes.
The function should accept (key, df) and it can return anything.
This method should be used only when the function is expensive and it’s worth to execute it in a separate subprocess.
If the function isn’t expensive, iterating over etl.groupby_iter()
and calling the function is probably faster.
In the following example, the defined function is executed in separate processes and it just returns the grouping key and the length of the partial dataframe.
def func(key, df):
return key, len(df)
df.etl.groupby_run_parallel(['simulation_id', 'circuit_id', 'neuron_class'], func=func)
[(RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_EXC'), 52),
(RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_INH'), 48),
(RecordKey(simulation_id=1, circuit_id=0, neuron_class='L23_EXC'), 28),
(RecordKey(simulation_id=1, circuit_id=0, neuron_class='L23_INH'), 47),
(RecordKey(simulation_id=2, circuit_id=0, neuron_class='L23_EXC'), 12),
(RecordKey(simulation_id=2, circuit_id=0, neuron_class='L23_INH'), 13)]
The same result is returned when etl.groupby_run_parallel()
is applied to the columnar df_col
:
def func(key, df):
return key, len(df)
df_col.etl.groupby_run_parallel(['simulation_id', 'circuit_id', 'neuron_class'], func=func)
[(RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_EXC'), 52),
(RecordKey(simulation_id=0, circuit_id=0, neuron_class='L23_INH'), 48),
(RecordKey(simulation_id=1, circuit_id=0, neuron_class='L23_EXC'), 28),
(RecordKey(simulation_id=1, circuit_id=0, neuron_class='L23_INH'), 47),
(RecordKey(simulation_id=2, circuit_id=0, neuron_class='L23_EXC'), 12),
(RecordKey(simulation_id=2, circuit_id=0, neuron_class='L23_INH'), 13)]
If the function returns a DataFrame and all the DataFrames should be concatenated together to create a single DataFrame, it’s possible to use etl.groupby_apply_parallel()
.
This is just a shortcut to calling etl.groupby_run_parallel()
and concatenate the results with pd.concat()
.
Note that this method should be considered still experimental and subject to changes.
def func(key, df):
return pd.DataFrame(
data={"mean": [df['time'].mean()]},
index=pd.MultiIndex.from_tuples([key], names=key._fields)
)
df_col.etl.groupby_apply_parallel(['simulation_id', 'circuit_id', 'window'], func=func)
mean | |||
---|---|---|---|
simulation_id | circuit_id | window | |
0 | 0 | w1 | 2.472794 |
1 | 0 | w1 | 2.550862 |
2 | 0 | w1 | 2.459742 |
In simple cases like in the previous example, it may be more efficient to avoid multiprocessing:
df_col.groupby(['simulation_id', 'circuit_id', 'window'])[['time']].mean()
time | |||
---|---|---|---|
simulation_id | circuit_id | window | |
0 | 0 | w1 | 2.472794 |
1 | 0 | w1 | 2.550862 |
2 | 0 | w1 | 2.459742 |