Data Processing¶
DataProcessing ¶
Source code in bioepic_skills/data_processing.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 | |
convert_to_df ¶
Convert a list of dictionaries to a pandas dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
list
|
A list of dictionaries. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A pandas dataframe. |
Source code in bioepic_skills/data_processing.py
split_list ¶
Split a list into chunks of a specified size.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_list
|
list
|
The list to split. |
required |
chunk_size
|
int
|
The size of each chunk. Default is 100. |
100
|
Returns:
| Type | Description |
|---|---|
list
|
A list of lists, where each inner list is a chunk of the original list. |
Source code in bioepic_skills/data_processing.py
rename_columns ¶
Rename columns in a pandas dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The pandas dataframe to rename columns. |
required |
new_col_names
|
list
|
A list of new column names. Names MUST be in order of the columns in the dataframe. Example: If the current column names are - ['old_col1', 'old_col2', 'old_col3'] You will need to pass in the new names like - ['new_col1', 'new_col2', 'new_col3'] |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A pandas dataframe with renamed columns. |
Source code in bioepic_skills/data_processing.py
merge_dataframes ¶
Merge two dataframes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
column
|
str
|
The column to merge on. |
required |
df1
|
DataFrame
|
The first dataframe to merge. |
required |
df2
|
DataFrame
|
The second dataframe to merge. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A pandas dataframe with the merged data. |
Source code in bioepic_skills/data_processing.py
merge_df ¶
Define a merging function to join results This function merges new results with the previous results that were used for the new API request. It uses two keys from each result to match on.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df1
|
DataFrame
|
The first dataframe to merge. |
required |
df2
|
DataFrame
|
The second dataframe to merge. |
required |
key1
|
str
|
The key in df1 to match with key2 in df2. |
required |
key2
|
str
|
The key in df2 to match with key1 in df1. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A pandas dataframe with the merged data. |
Source code in bioepic_skills/data_processing.py
build_filter ¶
Create a MongoDB filter using $regex for each attribute in the input dictionary. For nested attributes, use dot notation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attributes
|
dict
|
Dictionary of attribute names and their corresponding values to match using regex. Example: {"name": "example", "description": "example", "geo_loc_name": "example"} |
required |
exact_match
|
bool
|
This var is used to determine if the inputted attribute value is an exact match or a partial match. Default is False, meaning the user does not need to input an exact match. Under the hood this is used to determine if the inputted attribute value should be wrapped in a regex expression. |
False
|
Returns:
| Type | Description |
|---|---|
dict
|
A dictionary representing the MongoDB filter. Example: {"name": {"$regex": "example", "$options": "i"}, "description": {"$regex": "example", "$options": "i"}} |
Source code in bioepic_skills/data_processing.py
extract_field ¶
Extract a specific field from a list of API results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_results
|
list
|
A list of dictionaries representing API results. |
required |
field_name
|
str
|
The name of the field to extract. |
required |
Returns:
| Type | Description |
|---|---|
list
|
A list of values for the specified field. |